GithubHelp home page GithubHelp logo

gpertea / stringtie Goto Github PK

View Code? Open in Web Editor NEW
365.0 28.0 76.0 33.03 MB

Transcript assembly and quantification for RNA-Seq

License: MIT License

Makefile 0.87% C++ 69.07% C 27.19% Shell 0.34% Python 0.32% Roff 0.18% Perl 1.17% M4 0.67% Ruby 0.03% SWIG 0.17%

stringtie's Introduction

alt text

StringTie: efficient transcript assembly and quantitation of RNA-Seq data

Stringtie employs efficient algorithms for transcript structure recovery and abundance estimation from bulk RNA-Seq reads aligned to a reference genome. It takes as input spliced alignments in coordinate-sorted SAM/BAM/CRAM format and produces a GTF output which consists of assembled transcript structures and their estimated expression levels (FPKM/TPM and base coverage values).

For additional StringTie documentation and the latest official source and binary packages please refer to the official website: https://ccb.jhu.edu/software/stringtie

Obtaining and installing StringTie

Source and binary packages for this software can be directly downloaded from the Releases page for this repository. StringTie is compatible with a wide range of Linux and Apple OS systems. The main program (StringTie) does not have any other library dependencies (besides zlib) and in order to compile it from source it requires a C++ compiler which supports the C++ 11 standard (GCC 4.8 or newer).

Building the latest version from the repository

In order to compile the StringTie source in this GitHub repository the following steps can be taken:

git clone https://github.com/gpertea/stringtie
cd stringtie
make release

During the first run of the above make command a few library dependencies will be downloaded and compiled, but any subsequent stringtie updates (using git pull) should rebuild much faster.

To complete the installation, the resulting stringtie binary can then be copied to a programs directory of choice (preferably one that is in the current shell's PATH).

Building and installing of StringTie this way should take less than a minute on a regular Linux or Apple MacOS desktop.

Note that simply running make would produce a less optimized executable which is suitable for debugging and runtime checking but that is significantly slower than the optimized version which is built by using the make release command as instructed above.

Using pre-compiled (binary) releases

Instead of compiling from source, some users may prefer to download an already compiled binary for Linux and Apple MacOS, ready to run. These binary package releases are compiled on older versions of these operating systems in order to provide compatibility with a wide range of OS versions not just the most recent distributions. These precompiled packages are made available on the Releases page for this repository. Please note that these binary packages do not include the optional super-reads module, which currently can only be built on Linux machines from the source made available in this repository.

Running StringTie

The generic command line for the default usage has this format:

stringtie [-o <output.gtf>] [other_options] <read_alignments.bam> 

The main input of the program (<read_alignments.bam>) must be a SAM, BAM or CRAM file with RNA-Seq read alignments sorted by their genomic location (for example the accepted_hits.bam file produced by TopHat, or HISAT2 output sorted with samtools sort etc.).

The main output is a GTF file containing the structural definitions of the transcripts assembled by StringTie from the read alignment data. The name of the output file should be specified with the -o option. If this -o option is not used, the output GTF with the assembled transcripts will be printed to the standard output (and can be captured into a file using the > output redirect operator).

Note: if the --mix option is used, StringTie expects two alignment files to be given as positional parameters, in a specific order: the short read alignments must be the first file given while the long read alignments must be the second input file. Both alignment files must be sorted by genomic location.

stringtie [-o <output.gtf>] --mix [other_options] <short_read_alns.bam> <long_read_alns.bam> 

Note that the command line parser in StringTie allows arbitrary order and mixing of the positional parameters with the other options of the program, so the input alignment files can also precede or be given in between the other options -- the following command line is equivalent to the one above:

stringtie <short_read_alns.bam> <long_read_alns.bam> --mix [other_options] [-o <output.gtf>] 

Running StringTie on the provided test/demo data

When building from this source repository, after the program was compiled with make release as instructed above, the generated binary can be tested on a small data set with a command like this:

make test

This will run the included run_tests.sh script which downloads a small test data set and runs a few simple tests to ensure that the program works and generates the expected output.

If a pre-compiled package is used instead of compiling the program from source, the run_tests.sh script is included in the binary package as well and it can be run immediately after unpacking the binary package:

tar -xvzf stringtie-2.2.0.Linux_x86_64.tar.gz
cd stringtie-2.2.0.Linux_x86_64
./run_tests.sh

These small test/demo data sets can also be downloaded separately as test_data.tar.gz along with the source package and pre-compiled packages on the Releases page of this repository.

The tests can also be run manually as shown below (after changing to the test_data directory, cd test_data):

Test 1: Input consists of only alignments of short reads

stringtie -o short_reads.out.gtf short_reads.bam

Test 2: Input consists of alignments of short reads and superreads

stringtie -o short_reads_and_superreads.out.gtf short_reads_and_superreads.bam

Test 3: Input consists of alignments of long reads

stringtie -L -o long_reads.out.gtf long_reads.bam

Test 4: Input consists of alignments of long reads and reference annotation (guides)

stringtie -L -G human-chr19_P.gff -o long_reads_guided.out.gtf long_reads.bam

Test 5: Input consists of alignments of short reads and alignments of long reads (using --mix option)

stringtie --mix -o mix_reads.out.gtf mix_short.bam mix_long.bam

Test 6: Input consists of alignments of short reads, alignments of long reads and a reference annotation (guides)

stringtie --mix -G mix_guides.gff -o mix_reads_guided.out.gtf mix_short.bam mix_long.bam

These tests should complete in several seconds.

For large data sets one can expect up to one hour of processing time. A minimum of 8GB of RAM is recommended for running StringTie on regular size RNA-Seq samples, with 16 GB or more being strongly advised for larger data sets.

StringTie options

The following optional parameters can be specified (use -h or --help to get the usage message):

 --mix : both short and long read data alignments are provided
        (long read alignments must be the 2nd BAM/CRAM input file)
 --rf : assume stranded library fr-firststrand
 --fr : assume stranded library fr-secondstrand
 -G reference annotation to use for guiding the assembly process (GTF/GFF)
 --conservative : conservative transcript assembly, same as -t -c 1.5 -f 0.05
 --ptf : load point-features from a given 4 column feature file <f_tab>
 -o output path/file name for the assembled transcripts GTF (default: stdout)
 -l name prefix for output transcripts (default: STRG)
 -f minimum isoform fraction (default: 0.01)
 -L long reads processing; also enforces -s 1.5 -g 0 (default:false)
 -R if long reads are provided, just clean and collapse the reads but
    do not assemble
 -m minimum assembled transcript length (default: 200)
 -a minimum anchor length for junctions (default: 10)
 -j minimum junction coverage (default: 1)
 -t disable trimming of predicted transcripts based on coverage
    (default: coverage trimming is enabled)
 -c minimum reads per bp coverage to consider for multi-exon transcript
    (default: 1)
 -s minimum reads per bp coverage to consider for single-exon transcript
    (default: 4.75)
 -v verbose (log bundle processing details)
 -g maximum gap allowed between read mappings (default: 50)
 -M fraction of bundle allowed to be covered by multi-hit reads (default:1)
 -p number of threads (CPUs) to use (default: 1)
 -A gene abundance estimation output file
 -E define window around possibly erroneous splice sites from long reads to
    look out for correct splice sites (default: 25)
 -B enable output of Ballgown table files which will be created in the
    same directory as the output GTF (requires -G, -o recommended)
 -b enable output of Ballgown table files but these files will be 
    created under the directory path given as <dir_path>
 -e only estimate the abundance of given reference transcripts (requires -G)
 --viral : only relevant for long reads from viral data where splice sites
    do not follow consensus (default:false)
 -x do not assemble any transcripts on the given reference sequence(s)
 -u no multi-mapping correction (default: correction enabled)
 --ref/--cram-ref reference genome FASTA file for CRAM input

Transcript merge usage mode: 

  stringtie --merge [Options] { gtf_list | strg1.gtf ...}
With this option StringTie will assemble transcripts from multiple
input files generating a unified non-redundant set of isoforms. In this mode
the following options are available:
  -G <guide_gff>   reference annotation to include in the merging (GTF/GFF3)
  -o <out_gtf>     output file name for the merged transcripts GTF
                    (default: stdout)
  -m <min_len>     minimum input transcript length to include in the merge
                    (default: 50)
  -c <min_cov>     minimum input transcript coverage to include in the merge
                    (default: 0)
  -F <min_fpkm>    minimum input transcript FPKM to include in the merge
                    (default: 1.0)
  -T <min_tpm>     minimum input transcript TPM to include in the merge
                    (default: 1.0)
  -f <min_iso>     minimum isoform fraction (default: 0.01)
  -g <gap_len>     gap between transcripts to merge together (default: 250)
  -i               keep merged transcripts with retained introns; by default
                   these are not kept unless there is strong evidence for them
  -l <label>       name prefix for output transcripts (default: MSTRG)

More details about StringTie options can be found in the online manual.

Input files

StringTie takes as input a SAM, BAM or CRAM file sorted by coordinate (genomic location). This file should contain spliced RNA-seq read alignments such as the ones produced by TopHat or HISAT2. TopHat output is already sorted. Unsorted SAM or BAM files generated by other aligners should be sorted using the samtools program:

samtools sort -o alns.sorted.bam alns.sam

The file resulted from the above command (alns.sorted.bam) can be used as input to StringTie.

Any SAM record with a spliced alignment (i.e. having a read alignment across at least one junction) should have the XS tag to indicate the transcription strand, i.e. the genomic strand from which the RNA that produced this read originated. TopHat and HISAT2 alignments already include this tag, but if you use a different read mapper you should check that this tag is also included for spliced alignment records. STAR aligner should be run with the option --outSAMstrandField intronMotif in order to generate this tag.

The XS tags are not necessary in the case of long RNA-seq reads aligned with minimap2 using the -ax splice option. minimap2 adds the ts tags to splice alignments to indicate the transcription strand (albeit in a different manner than the XS tag), and StringTie can recognize the ts tag as well, if the XS tag is missing. Thus the long read spliced alignments produced by minimap2 can be also assembled by StringTie (with the option -L or as the 2nd input file for the --mix option).

As explained above, the alignments must be sorted by coordinate before they can be used as input for StringTie.

When CRAM files are used as input, the original reference genomic sequence can be provided with the --ref (--cram-ref) option as a multi-FASTA file with the same chromosome sequences that were used when aligning the reads. This is optional but recommended because StringTie can better estimate the quality of some spliced alignments (e.g. noticing mismatches around junctions) and that data can be retrieved in the case of some CRAM files only when the reference genome sequence is also provided.

Reference transcripts (guides)

A reference annotation file in GTF or GFF3 format can be provided to StringTie using the -G option which can be used as 'guides' for the assembly process.

When the -e option is used (i.e. expression estimation only), this option is required, and in that case StringTie will not attempt to assemble the read alignments but instead it will only estimate the expression levels of all the transcripts provided in this file

Note that when a reference transcript is fully covered by reads, the original transcript ID from the reference annotation file will be shown in StringTie's output record in the reference_id GTF attribute. Output transcripts that lack such reference_id attribute can be considered "novel" transcript structures with respect to the given reference annotation.

The super-reads module

This optional module can be used to de-novo assemble, align and pre-process RNA-Seq reads, preparing them to be used as "super-reads" by Stringtie.

More usage information is provided in SuperReads_RNA/README.md. Quick installation instructions for this module from the source available on this repository (assuming main Stringtie installation was already completed as described above):

 cd SuperReads_RNA
 ./install.sh

Using super-reads with Stringtie

After running the super-reads module (see the SuperReads_RNA module documentation for usage details), there is a BAM file created which contains sorted alignment for both short reads and super-reads, called sr_merge.bam, created in the selected output directory. This file can be directly given as the main input file to StringTie as described in the Running StringTie section above.

License

StringTie is free, open source software released under an MIT License.

Publications

Shumate A, Wong B, Pertea G, Pertea M Improved transcriptome assembly using a hybrid of long and short reads with StringTie, PLOS Computational Biology 18, 6 (2022), doi.org/10.1371/journal.pcbi.1009730

Kovaka S, Zimin AV, Pertea GM, Razaghi R, Salzberg SL, Pertea M Transcriptome assembly from long-read RNA-seq alignments with StringTie2, Genome Biology 20, 278 (2019), doi:10.1186/s13059-019-1910-1

Pertea M, Kim D, Pertea GM, Leek JT, Salzberg SL Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown, Nature Protocols 11, 1650-1667 (2016), doi:10.1038/nprot.2016.095

Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT & Salzberg SL StringTie enables improved reconstruction of a transcriptome from RNA-seq reads, Nature Biotechnology 2015, doi:10.1038/nbt.3122

stringtie's People

Contributors

gpertea avatar gsatas avatar ljyanesm avatar mpertea avatar nsoranzo avatar razsultana avatar smoe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

stringtie's Issues

Segmentation fault with annotation file

Hi, I'm trying to run stringtie (v1.0.2) with a annotation gff file, but every time I run it, it throws a segmentation fault (core dumped) and quits. Without the annotation gff file stringtie seems to run fine. Is this a bug, or might there be something wrong with my annotation file?

I want to compare the stringtie output with cufflinks and because I run cufflinks with annotation, I want to run stringtie with an annotation file as well.

Thanks,
Peter

Error at GBitVec: index out of bounds (Stringtie version 1.2.2)

Hi,

This issue has been reported and closed in #26 and #32, but we are currently still running into the same error with Stringtie version 1.2.2.

[04/18 16:52:28]>bundle ST4.03ch09:61229862-61324206(2719458) (0 guides) loaded, begins processing...
[04/18 16:52:28]Error at GBitVec: index 480460105 out of bounds (size 480460105)

The BAM file was generated with Hisat version 2.0.1-beta with the --dta and --rna-strandness options. We then run Stringtie (using ~16Gb of RAM) with the following command:
$stringtie $sample.bam -p $threads -o $sample.gtf

I've extracted the regions affected (ST4.03ch09:61229862-61324206) into a smaller BAM file for your reference.

Cheers.

Error correction of PE reads failed. Check pe.cor.log.

Hi,
I am trying to use the superreads.pl script as follow
superreads.pl /input/R1.fastq.gz /input/R2.fastq.gz /opt/MaSuRCA-3.1.3 -l /output/LongReads.fq -t 25 -j 30000000000

unfortunately the process fails saying "Check pe.cor.log." which does not exist.

I am running the app inside a docker container. I use the same approach for other memory eager apps. the machine has 128 GB Ram

from shell:
root@572d1bf4480d:/output# ls -l
total 62142048
-rw-r--r-- 1 root root 14683831229 Sep 15 14:47 R1.fastq.gz
-rw-r--r-- 1 root root 14683831229 Sep 15 14:50 R2.fastq.gz
-rwxr-xr-x 1 root root 6838 Sep 15 14:50 assemble.sh
-rw-r--r-- 1 root root 274 Sep 15 14:50 environment.sh
-rw-r--r-- 1 root root 10 Sep 15 14:50 meanAndStdevByPrefix.pe.txt
-rw-r--r-- 1 root root 25145047806 Sep 15 14:32 pe.renamed.fastq
-rw-r--r-- 1 root root 2520000 Sep 15 14:50 pe_data.tmp
-rw-r--r-- 1 root root 168 Sep 15 14:50 quorum.err
-rw-r--r-- 1 root root 0 Sep 15 14:32 quorum_mer_db.jf
-rw-r--r-- 1 root root 2054 Sep 15 14:50 sr_config.txt

from "environment.sh"

PATH="/opt/MaSuRCA-3.1.3/bin:/opt/MaSuRCA-3.1.3/bin/../CA/Linux-amd64/bin:/opt/MaSuRCA-3.1.3/bin:/opt/jellyfish-2.2.3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
NUM_THREADS="20"
PE_AVG_READ_LENGTH="125"
KMER="87"
MIN_Q_CHAR="33"
JF_SIZE="30000000000"

from stdout

root@572d1bf4480d:/output# superreads.pl /input/R1.fastq.gz /input/R2.fastq.gz /opt/MaSuRCA-3.1.3 -l /output/LongReads.fq -t 20 -j 30000000000
Starting step 1: process input files files....
Done step 1.
Starting step 2: prepare files for assembly....
Running /opt/MaSuRCA-3.1.3/bin/masurca sr_config.txt
Verifying PATHS...
jellyfish OK
runCA OK
createSuperReadsForDirectory.perl OK
creating script file for the actions...done.
execute assemble.sh to run assembly
Done step 2.
Starting step 3: run MaSuRCA super-read module....
[Tue Sep 15 14:50:41 UTC 2015] Processing pe library reads
Average PE read length 125
choosing kmer size of 87 for the graph
MIN_Q_CHAR: 33
[Tue Sep 15 14:50:43 UTC 2015] Error correct PE.
./assemble.sh: line 96: 73 Aborted (core dumped) quorum_error_correct_reads -q $((MIN_Q_CHAR + 40)) --contaminant=/opt/MaSuRCA-3.1.3/bin/../share/adapter.jf -m 1 -s 1 -g 1 -a 3 -t 20 -w 10 -e 3 quorum_mer_db.jf pe.renamed.fastq --no-discard -o pe.cor --verbose > quorum.err 2>&1
[Tue Sep 15 14:50:43 UTC 2015] Error correction of PE reads failed. Check pe.cor.log.

Error at GBitVec: index out of bounds

I'm getting this error message when running stringtie V1.1.1 in the "-G" regime.
Error at GBitVec: index 391265981 out of bounds (size 391265970)

Any ideas?

exon length is ZERO!

Hi, guys

Did you ever notice that in StringTie's gtf lengths of some exons were equal to 0?

keep track of raw read counts per locus (gene)

These per-locus (gene) read counts should always be available internally but when Ballgown output is requested they should also be written to these additional files:

  • g_data.ctab (structure similar to t_data.ctab, with a g_id assigned to each gene /locus)
  • t2g.ctab (linking t_id to g_id)

Killed by kernel due to memory exhaustion (32GB Memory + 32GB Swap)

I am using stringtie 1.0.2 on a ubuntu server, which has 32GB memory and 32GB swap, 8 core cpu. When I am processing a bam file generated by HISAT, the stringtie process always be killed by linux kernel due to 32GB memory exhaustion. I checked the log file and found it it always stoped here:
[04/04 22:09:10]>bundle chr7:46443159-46919933(2081294) (2027js, 19 guides) loaded, begins processing...
In this bundle, mem and swap usage always increased to 100%. It seems that 2081294 means reads number, I guess the -s option may affect the memory usage. So I change -s default number to -s 500000. But it didn't work.
Can you help me to figure out what caused the problem and how to fix it?

gene id from reference GTF not shown in the output file

Hi,

I'm using StringTie and it runs smoothly and quite fast on my data, however, when I look at the output file I don't see the gene_ids that were in the GTF file, instead I see the following pattern:

.... gene_id "samplename.1"; ...
.... gene_id "samplename.2"; ...
.... gene_id "samplename.3"; ...
etc etc

Only the reference id is found the same where I can actually look up the gene_id in the GTF file and then search for it in the StringTie output. But this is not useful since I would like to have a look at the FPKM before proceeding in the differential analysis.

I'm using the following command line to run StringTie:

stringtie sample1.sorted.bam -e -p 4 -G GTF/dm3_transcriptome_annotation.gtf -l sample1 -o sample1

Stringtie only processed chr1

Hi,

I ran stringtie v1.0.4 successfully on all of my samples except for one. For this sample, stringtie only processed chr1 and exited without any error. There's nothing particularly different with this sample. I am able to run cufflinks on all samples, i.e. including the sample that stringtie has a problem with. I'm wondering if you can help me to figure this out. I dont see any error message anywhere?

Here's how I ran stringtie:
stringtie mapped.bam -G genes.gtf -o output.gtf -p 4 -m 50 -C output.cov -B -c 0.001 -v

Here's the log for that particular run
[06/29 15:07:48] Loading reference annotation (guides)..
GFF warning: merging adjacent/overlapping segments of NM_001267620 on chr1 (75188734-75189163, 75189165-75189252)
GFF warning: merging adjacent/overlapping segments of NM_026187 on chr1 (75188734-75189163, 75189165-75189252)
....[truncated]
[06/29 15:07:58]>bundle chr1:16091376-16094514(21295) (3js, 1 guides) loaded, begins processing...
[06/29 15:07:58] 258732 aligned fragments found.
[06/29 15:07:59]^bundle chr1:16091376-16094514(21295) done (3 processed potential transcripts).
[06/29 15:07:59] All threads finished.
Total count of aligned fragments: 258732
Average fragment length:196.838

Thanks,
Tisha

StringTie not creating output dir automatically

Hi,

I ran StringTie interactively (qrsh) on the JHPCE cluster and it would create the output directory but hang at that point. When I tried running StringTie via submitted jobs (qsub) the job would fail because the output directory was not created by StringTie. From the docs, I understand that StringTie does this step. Anyhow, simply creating the output directory before running StringTie solves the problem.

So, if you just don't want to support this small issue, simply update the manual.

Here is an example of the type of error I was getting (via qsub):

Error creating output file
/dcs01/ajaffe/Brain/derRuns/derSupplement/simulation/stringtie/sample1G1R1/outfile.gtf.tmp

where /dcs01/ajaffe/Brain/derRuns/derSupplement/simulation/stringtie existed already, but not the subdirectory sample1G1R1.

This is with StringTie version 1.1.2.

Best,
Leo

manual issue: does hisat2 support on the fly passing of --ss and --exon?

In the Stringtie manual it says this:

Our recommended workflow includes the following steps:
for each RNA-Seq sample, map the reads to the genome with HISAT2 using the --dta option. It is highly recommended to use the reference annotation information when mapping the reads, which can be either embedded in the genome index (built with the -ss and --exon options, see HISAT2 manual), or provided separately at run time (-ss and --exon options of HISAT2). The SAM output of each HISAT2 must be sorted and converted to BAM using samtools as explained above.

but there doesn't seem to be an option to pass --ss and --exon in hisat2 in either the manual or the help in version 2.0.1.

Majority of FPKMs are zero

Hi,

I am using the NCBI GRCH38 containing chromosome names as NC_.xxx to which I aligned my RNA-Seq data (101bp paired-end). I have used the annotation gff with same chromosome IDs, but with the -b option, all my FPKM values are zero. I was wondering if it has anything to do with the annotation file?

Srikanth

Stringtie error

Hi,

I am running stringtie with gencode v22 annotation on one of my samples of human RNA-Seq data from a cell line.
The command I am using:

stringtie JH_04.sorted.bam -G Gencode.v22.annotation.gtf -p 12 -l JH_04 -b read_counts_JH_04 -o Assembled_transcripts_JH_04.gtf
It throws warnings that read saturation reached for ChrM some locations. And then after 2-3 days of running, and utilizing entire available RAM, the process is killed. I do not know what is the issue.

I have tried with single to a max of 20 cores and all of them never finish and are aborted/killed automatically. The max memory used is 200G. I am wondering if there is some memory leak or some other issue.

Any help is appreciated.

Srikanth

StringTie model extends beyond edge of reference

Hi,

I used hisat2 to map a few sets of RNA-seq reads to my reference genome, then used stringtie with all defaults to assemble, then used stringtie --merge to combine assemblies from different sets. I tried extracting nucleotide sequences for the merged assembly from my reference genome and found that the coordinates of at least one StringTie model went beyond the edge of the reference: the model coords are 13646-16322 but the scaffold length is only 16303. I thought maybe there was a StringTie setting I could turn on that regulates this behavior but I couldn't find one. I couldn't find a setting in hisat2 either but maybe that's where I need to look more, what do you think?

Thanks!
Matt

Missing strand information on single exon transcripts

Hello,

I just started running StringTie recently using strand-specific RNA-Seq data aligned using GSNAP, and I noticed that the assembled monoexonic transcripts lack the strand information. Since GSNAP output provides the XS tag info only on the spiced alignments, I would like to know if the issue is due to the lack of this flag also in the unspliced reads.

It would be great if you can tell me if this can be fixed by manually adding the XS tag also on the unspliced reads or if I need to do anything else.

Thanks a lot!

Not fully output all reference transcripts with annotation file

Hi,I'm using stringtie to quantify reference transcripts expression with annotation file(UCSC GTF).However,only ~26000 reference transcripts were output(total 99900).I guess,it may be the influence of -c parameter.Confusingly,when I reset "-c 0.0",there is an error "Segmentation fault (core dumped)";when I reset "-c 0.1",there is no any error,but output transcripts remain ~26000,nearly euqal compared to "-c 2.5". I want to obtain all reference transcripts expression values(although 0),can you tell me how to do that using stringtie?and,why presented above confusing results when adjusted -c parameter?

Thank you very much
Charlin

Stringtie --merge

So I compared the new --merge option with cuffmerge on the same set of GTF files.
Here is what I get:

Hisat 2.0.0 + Stringtie + Stringtie --merge
parsed genome node DAGs: 20343
sequence regions: 804 (total length: 235278766)
genes: 19537
mRNAs: 52981
exons: 310477

Hisat 2.0.0 + Stringtie + Cuffmerge
parsed genome node DAGs: 18122
sequence regions: 815 (total length: 235906968)
genes: 17307
mRNAs: 61860
exons: 460638

Any idea what is going on? Aren't they supposed to be equivalent? is there a reason to assume that the output from Stringtie is superior?

descrepancy between gene abundance and tx abundance.

Hi, I used stringtie v1.2.1 to estimate FPKM values against ensemble 75 gtf file on an alignment file generated with tophat2.

Stringtie command:

stringtie accepted_hits.bam -G ~/NGS/gatk_ref/ensembl_75_hg19.gtf -p 8 -A delMeLater/fuddls1.abundance -B -e -x chrM -o delMeLater/fuddls1.op

But the values estimated for all transcripts of a gene in transcript abundance file t_data.ctab do not add up to the value estimated for the same gene in gene abundance file.

For example, here is GAPDH expression from gene.abundance

$ grep -w GAPDH fuddls1.abundance 
ENSG00000111640 GAPDH   +   6643093 6647537 3999    7632.947266 697.140625  1722.543457

These GAPDH transcript abundances from t_data.ctab

$ grep -w GAPDH t_data.ctab 
36909   chr12   +   6643093 6647537 ENST00000229239 9   1875    ENSG00000111640 GAPDH   0.202568    0.018422
36910   chr12   +   6643678 6644307 ENST00000496049 2   390 ENSG00000111640 GAPDH   9.696434    0.881809
36911   chr12   +   6643698 6647506 ENST00000396856 9   1266    ENSG00000111640 GAPDH   0.271702    0.024709
36912   chr12   +   6643699 6646964 ENST00000492719 8   930 ENSG00000111640 GAPDH   29.442436   2.677542
36913   chr12   +   6643700 6647525 ENST00000396861 9   1348    ENSG00000111640 GAPDH   0.351579    0.031973
36914   chr12   +   6643707 6647480 ENST00000474249 8   1333    ENSG00000111640 GAPDH   0.214894    0.019543
36915   chr12   +   6643920 6647483 ENST00000466588 7   1363    ENSG00000111640 GAPDH   0.344162    0.031299
36916   chr12   +   6643920 6647505 ENST00000396859 8   1256    ENSG00000111640 GAPDH   0.367886    0.033456
36917   chr12   +   6643958 6647503 ENST00000466525 4   1720    ENSG00000111640 GAPDH   0.213218    0.019390
36918   chr12   +   6644468 6647481 ENST00000396858 8   1292    ENSG00000111640 GAPDH   0.357635    0.032524

Gene abundance gives a value of 1722.543457 FPKM for GAPDH which is what we expect but if I add up all values of GAPDH tarnscripts (last column FPKM) from t_data.ctab, it adds up to 3.77.

Is there anything I'm misinterpreting here ? I also used cufflinks to estimate the same, there GAPDH transcript values from isoforms.fpkm_tracking adds up to GAPDH gene value in genes.fpkm_tracking.

Thank you.

Compilation on Linux Ubuntu 12.04 and 14.04

Hi,

which dependencies are required? I am having problems with compilation.

cd stringtie-master
make release

I am guessing:
g++

I receive following error (apologies for the german).

Any ideas ?

Thanks,
Colin

bam_reheader.c:11:16: Warnung: Variable »old« gesetzt, aber nicht verwendet [-Wunused-but-set-variable]
gcc -c -g -Wall -O2 -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -D_CURSES_LIB=0 -I. kprobaln.c -o kprobaln.o
kprobaln.c: In Funktion »kpa_glocal«:
kprobaln.c:78:21: Warnung: Variable »is_diff« gesetzt, aber nicht verwendet [-Wunused-but-set-variable]
gcc -c -g -Wall -O2 -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -D_CURSES_LIB=0 -I. bam_cat.c -o bam_cat.o
ar -csru libbam.a bgzf.o kstring.o bam_aux.o bam.o bam_import.o sam.o bam_index.o bam_pileup.o bam_lpileup.o bam_md.o razf.o faidx.o bedidx.o bam_sort.o sam_header.o bam_reheader.o kprobal
n.o bam_cat.o
make[1]: Verlasse Verzeichnis '/home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18'
g++ -g -L./samtools-0.1.18 -o stringtie gclib/GBase.o gclib/GArgs.o gclib/GStr.o gclib/GBam.o gclib/gdna.o gclib/codons.o gclib/GFaSeqGet.o gclib/gff.o gclib/GThreads.o rlink.o tablemaker
.o stringtie.o -lz -lbam -lpthread
./samtools-0.1.18/libbam.a(bam_import.o): In function ks_getuntil2': /home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/bam_import.c:17: undefined reference togzread'
./samtools-0.1.18/libbam.a(bam_import.o): In function __bam_get_lines': /home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/bam_import.c:76: undefined reference togzdopen'
/home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/bam_import.c:92: undefined reference to gzclose' /home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/bam_import.c:76: undefined reference togzopen64'
./samtools-0.1.18/libbam.a(bam_import.o): In function sam_header_read2': /home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/bam_import.c:126: undefined reference togzdopen'
./samtools-0.1.18/libbam.a(bam_import.o): In function ks_getc': /home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/bam_import.c:17: undefined reference togzread'
./samtools-0.1.18/libbam.a(bam_import.o): In function sam_header_read2': /home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/bam_import.c:147: undefined reference togzclose'
/home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/bam_import.c:126: undefined reference to gzopen64' ./samtools-0.1.18/libbam.a(bam_import.o): In functionsam_open':
/home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/bam_import.c:468: undefined reference to gzdopen' /home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/bam_import.c:468: undefined reference togzopen64'
./samtools-0.1.18/libbam.a(bam_import.o): In function sam_close': /home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/bam_import.c:481: undefined reference togzclose'
./samtools-0.1.18/libbam.a(bgzf.o): In function deflate_block': /home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/bgzf.c:311: undefined reference todeflate'
/home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/bgzf.c:313: undefined reference to deflateEnd' /home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/bgzf.c:305: undefined reference todeflateInit2_'
/home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/bgzf.c:329: undefined reference to deflateEnd' /home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/bgzf.c:345: undefined reference tocrc32'
/home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/bgzf.c:346: undefined reference to crc32' ./samtools-0.1.18/libbam.a(bgzf.o): In functioninflate_block':
/home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/bgzf.c:380: undefined reference to inflateInit2_' /home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/bgzf.c:385: undefined reference toinflate'
/home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/bgzf.c:391: undefined reference to inflateEnd' /home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/bgzf.c:387: undefined reference toinflateEnd'
./samtools-0.1.18/libbam.a(razf.o): In function _razf_write': /home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/razf.c:197: undefined reference todeflate'
./samtools-0.1.18/libbam.a(razf.o): In function razf_open_w': /home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/razf.c:170: undefined reference todeflateInit2_'
/home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/razf.c:186: undefined reference to deflateSetHeader' /home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/razf.c:389: undefined reference toinflateInit2_'
/home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/razf.c:390: undefined reference to inflateEnd' ./samtools-0.1.18/libbam.a(razf.o): In function_razf_read':
/home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/razf.c:602: undefined reference to inflate' ./samtools-0.1.18/libbam.a(razf.o): In functionrazf_flush':
/home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/razf.c:230: undefined reference to deflate' ./samtools-0.1.18/libbam.a(razf.o): In function_razf_reset_read':
/home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/razf.c:709: undefined reference to inflateReset' /home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/razf.c:709: undefined reference toinflateReset'
./samtools-0.1.18/libbam.a(razf.o): In function razf_close': /home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/razf.c:826: undefined reference toinflateEnd'
./samtools-0.1.18/libbam.a(razf.o): In function razf_end_flush': /home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/razf.c:254: undefined reference todeflate'
./samtools-0.1.18/libbam.a(razf.o): In function razf_close': /home/bioinformatics/programs/stringtie/stringtie-master/samtools-0.1.18/razf.c:800: undefined reference todeflateEnd'
collect2: ld gab 1 als Ende-Status zurück
make: *** [stringtie] Fehler 1

cuffmerge replacement (--merge option?)

We should implement a "transcript merge" mode of stringtie where it would use the splice graph to simply assemble compatible isoforms loaded from multiple GTF files -- as a functional replacement of Cuffmerge. In this mode of operation stringtie would no longer expect an input BAM file but instead will take a list of GTF/GFF files and "assemble" the transcripts from those (without any attempt of abundance estimation of course).

Multi-threading not working?

Hi, the new suite of RNA-Seq tools look great and I'm excited to get using them.

I'm testing stringtie at the moment and it is taking longer than expected to run (~30-40 minutes for 4 million reads).
I'm using a '-p 32' flag but at most the software seems to only be using 3-5% of a single CPU, is there an issue with the multi-threading?

(Using the pre-built Linux x86_64 binary package distribution)

Support TPM

Are there any plans to support TPM in addition to FPKM as an estimator of relative abundance? I'd appreciate any single value that is comparable between samples - it doesn't have to be TPM, just free of scaling issues that depend on sequencing depth. I know that for differential expression analysis the consensus is to use raw counts and a discrete negative binomial model, but we often would like to quickly cluster samples for QC and hypothesis generation, and doing that directly from the stringtie output would be great.

minimum read coverage

I understand that '-c' option applies to novel transcripts only, but it may be useful for any types of transcripts regardless their 'novelty'. Sometimes it is required to have a certain number of mapped reads anywhere in the reference before being assembled into transcripts.

high FPKM called with few reads mapping to transcript

Thanks for making Stringtie-- I have been running it on some test data and it calls FPKM in places where there are almost no reads at all.

I ran it like this:

stringtie -b stringtie/tx/tmpsZIpEI/Test1 -p 1 -G /v-data/bcbio-nextgen/tests/data/genomes/mm9/rnaseq/ref-transcripts.gtf -e /v-data/bcbio-nextgen/tests/test_automated_output/align/Test1/1_110907_ERP000591_tophat/Test1.bam

which generates this:

18  chrM    +   6942    7011    ENSMUST00000082404  1   70  ENSMUSG00000064353  mt-Td   8911.531250 2380952.250000
19  chrM    +   7013    7696    ENSMUST00000082405  1   684 ENSMUSG00000064354  mt-Co2  911.998718  243664.703125
20  chrM    +   7700    7764    ENSMUST00000082406  1   65  ENSMUSG00000064355  mt-Tk   9597.033203 2564102.500000
21  chrM    +   7766    7969    ENSMUST00000082407  1   204 ENSMUSG00000064356  mt-Atp8 3057.877197 816993.187500
22  chrM    +   7927    8607    ENSMUST00000082408  1   681 ENSMUSG00000064357  mt-Atp6 916.016357  244738.109375
23  chrM    +   8607    9390    ENSMUST00000082409  1   784 ENSMUSG00000064358  mt-Co3  795.672424  212585.046875

ENSMUST00000082404 has a very high FPKM but no reads map there.

IGV screenshot:

igv_snapshot

It doesn't look like it on IGV, but there are a couple of reads that overhang ENSMUST00000082404 by a couple of bases:

vagrant@vagrant-ubuntu-trusty-64:/v-data/bcbio-nextgen/tests$ samtools view test_automated_output/upload/Test1/Test1-ready.bam chrM:6942-7011
ERR032227.10543296  99  chrM    7008    50  76M =   7233    301 CTTATATGGCCTACCCATTCCAACTTGGTCTACAAGACGCCACATCCCCTATTATAGAAGAGCTAATAAATTTCCA    HHHHHHHHHHHHHHHGHHHHFHHFHFHH?HFGGGEDFHEHEBGCBCFEF@EBECCB>>@;;A>??>B?>??=>?B=    MD:Z:76 RG:Z:1  XG:i:0  NH:i:1  NM:i:0  XM:i:0  XN:i:0  XO:i:0  AS:i:0  YT:Z:UU
ERR032227.10212620  99  chrM    7010    50  76M =   7263    329 CCTATGGCCTACCCATTCCAACTTGGTCTACAAGACGCCACATCCCCTATTATAGAAGAGCTAATAAATTTCCATG    HHHHDHHHHHHHHHGHHGHDHECHEH;GFEHGCCFCCGCCGA?@CEE?<?B@?A>:>=;>??=>?;>><:?;@;;;    MD:Z:0T0A74 RG:Z:1  XG:i:0  NH:i:1  NM:i:2  XM:i:2  XN:i:0  XO:i:0  AS:i:-10    YT:Z:UU
ERR032227.10782111  161 chrM    7010    50  76M =   7280    346 GTTATGGCCTACCCATTCCAACTTGGTCTACAAGACGCCACATCCCCTATTATAGAAGAGCTAATAAATTTCCATG    EEEEEEEEEEEEEEEEEEE?EAEEEE>BDEEEBE>=DB=?A>>??=BB<BA@B>A==>8>9A=:A=><;;=<=;<<    MD:Z:0T0A74 RG:Z:1  XG:i:0  NH:i:1  NM:i:2  XM:i:2  XN:i:0  XO:i:0  AS:i:-10    YT:Z:UU
ERR032227.10060210  99  chrM    7011    50  76M =   7257    322 CTATGGCCTACCCATTCCAACTTGGTCTACAAGACGCCACATCCCCTATTATAGAAGAGCTAATAAATTTCCATGA    HGHHGFHHHHHHHHHHHHHHFGHHH=HHHHGDBGHFHGGFE@AECF=B@AAAC::D@>?C>???@AC><>=??<=<    MD:Z:0A75   RG:Z:1  XG:i:0  NH:i:1  NM:i:1  XM:i:1  XN:i:0  XO:i:0  AS:i:-5 YT:Z:UU
ERR032227.11020015  163 chrM    7011    50  76M =   7191    256 CGATGGCCTACCCATTCCAACTTGGTCTACAAGACGCCACATCCCCTATTATAGAAGAGCTAATAAATTTCCATGA    EEEEEEEEEEEEEEEEEEEEBEEEE>EE8EEEE@EBEB@EB?ABAA?>>EA@??>=>:@<>?:;?<<@<===<<88    MD:Z:0A0T74 RG:Z:1  XG:i:0  NH:i:1  NM:i:2  XM:i:2  XN:i:0  XO:i:0  AS:i:-10    YT:Z:UU
ERR032227.11305478  99  chrM    7011    50  76M =   7258    323 CGATGGCCTACCCATTCCAACTTGGTCTACAAGACGCCACATCCCCTATTATAGAAGAGCTAATAAATTTCCATGA    HHHHHHHHHHHHHHHHHHHHHHHHHCHHHHHHHGGGHFEGBCFBEAA?=C@?C@=>@>AE?=BAA<@;@B:?=@?5    MD:Z:0A0T74 RG:Z:1  XG:i:0  NH:i:1  NM:i:2  XM:i:2  XN:i:0  XO:i:0  AS:i:-10    YT:Z:UU

The FPKM is so high because this is a really tiny test dataset of about 50k reads.

Cufflinks and featureCounts both call this as having an expression level of zero, which seems correct to me. Is there something I could tweak to get similar results? I tried setting -g 1 and omitting -e but that didn't make any difference. I tried the brew installed 1.0.3 and also building from master, but had the same behavior.

Here are the Cufflinks FPKM calls for reference:

ENSMUST00000082404  0.0
ENSMUST00000082405  224658.0
ENSMUST00000082406  0.0
ENSMUST00000082407  24456.4
ENSMUST00000082408  234162.0
ENSMUST00000082409  0.0

I can provide all of the test data if that would be helpful. Thanks!

segmentation fault when using annotation file with --merge

Hello,

I downloaded the gtf file from UCSC hg19 knownGene table and tried to use with the -G option when merging gtfs. I get a segmentation fault and the last few lines before stringtie exits are:

[03/10 12:03:30]>bundle chr14:105992953-107283085(273) (38 guides) loaded, begins processing...
Segmentation fault (core dumped)

The merging option works fine without a reference annotation file.

Thanks,

Jenny

minimum read coverage by exons instead of transcript (-c)

I'm writing to know if you could set a parameter that would filter out exons that do not pass a threshold of minimum read coverage, because I have an example of a transcript that has an extra exon with a coverage of only 3 while all the other exons have at least 80 reads. Also sometimes exons are too long
maybe an option maybe added to limit the extent of the exons. This would facilitate the study of differential alternative splicing.

best regards.

SAM error: found spliced alignment without XS attribute

Hi,

I'm using the assembled transcripts GTF with StringTie to run cuffmerge,but i get error as follows:

[14:42:00] Loading reference annotation.
[14:42:03] Inspecting reads and determining fragment length distribution.
SAM error on line 470: found spliced alignment without XS attribute
SAM error on line 471: found spliced alignment without XS attribute
SAM error on line 896: found spliced alignment without XS attribute
SAM error on line 897: found spliced alignment without XS attribute
SAM error on line 1218: found spliced alignment without XS attribute

I use the bam file produced by HISAT,there are some spliced read alignment without the tag XS .

as follows:
HWI-7001455:320:HH7WNADXX:1:2110:1668:35148 99 3 9795235 255 21M77N105M = 9795423 314 GTTACTTTCTCTGTCCCCAAGGGTTTTCACTGAATTCTCAGGATTGCGAAGCTCCTCTGCTTCTCTTCCCTTCGGCAAGAAACTTTCTTCCGATGAGTTCGTTTCCATCGTCTCCTTCCAGACTTC ;988(.)9:=4@;.>))<<:((-268=@=1=?63<=?>??)3:>??>?####################################################### AS:i:-15 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:23G102 YS:i:-5 YT:Z:CP

I am not sure the sam caused the error of cuffmerge ?

Query on FPKM calculation

Hi Geo,

When I use the -e option in stringtie, are the FPKM calculations done based on reads mapped to the provided GTF or total mapped reads in the BAM file?
I am trying to calculate the FPKM of only protein-coding genes (as a GTF), but I was wondering if the FPKM values also contain reads mapping to non-coding regions.

Srikanth

Stringtie -merge Segmentation fault

I'm trying to run -merge however I get a segmentation fault error every time. The line of code is:

nohup /labshare/Nick/tools/stringtie-1.2.0.Linux_x86_64/stringtie --merge /labseq/analysis_results/projects/"$PROJDIR"/"$PROJNAME"/gtf.list -G /labseq/Genomes/rsem/hg19.gtf -o /labseq/analysis_results/projects/"$PROJDIR"/"$PROJNAME"/"$PROJNAME".run1.annotation.gtf -v > /labseq/analysis_results/projects/"$PROJDIR"/"$PROJNAME"/log/"$PROJDIR"."$PROJNAME".merge.log &

Please let me know what info will help you help me in this process.

The run1 ran perfectly smoothly, and merge has run when I use a small subset (6 samples) of my cohort (311 samples).

The samples were aligned with 2-pass STAR alignment. Any help would be much appreciated.

Best

super-reads

hello..
In my StringTie directory produce many files when i run superreads.pl script.Whether the LongReads.fq.gz is representation super-reads file and the 1_1.notAssembled.fq.gz and 1_2.notAssembled.fq.gz representation unassembled paired reads?

LongReads.fq.gz

longreads

why i can't use gunzip to unzip the <superreads.pl script> produced LongReads.fq.gz?

Runtime Error: Error at GBitVec

After your suggestion on Biostars regarding the Runtime Error:

'Error at GBitVec: index 7 out of bounds (size 7)'

I updated StringTie to version 1.1.1, but the error keeps happening. After using the parameter -v the following lines are the last before StringTie crashes:

[11/08 14:57:59]>bundle 6:1383650-1383656(1) (0 guides) loaded, begins processing...
[11/08 14:57:59]Number of fragments in bundle: 1 with sum 7
^bundle 6:1383650-1383656(1) done (0 processed potential transcripts).
[11/08 14:57:59]>bundle 6:1383790-1389117(31) (1 guides) loaded, begins processing...
Error at GBitVec: index 7 out of bounds (size 7)

Maybe I misunderstand the parameter -c, I increased the default value to 10 but the error stays.

GFF3 parsing bug: parentless CDS/exon feature not recognized as transcript

I've been using stringtie successfully for a while now. I just updated to the newest version, but now I'm getting the following error:

Error: could not read reference annotation transcripts from GTF/GFF xam668_prokka.exon.gff3 - invalid file?

I have run an older version (which version I'm not sure at the moment) of stringtie successfully with this file just the other day, however the update seems to be checking something that I can't for the life of me figure out. Here are a few of the various versions of my gff file I have tried, the first one is the original and the one that worked with an older version of stringtie:

Xam668_contig195        Prodigal:2.6    CDS     700     1398    .       -       0       ID=xam668_04238;gene=xam668_04238;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:RefSeq:WP_007974205.1;locus_tag=xam668_04238;product=partition protein

Xam668_contig195        Prodigal:2.6    exon    700     1398    .       -       0       ID=xam668_04238;gene_name=xam668_04238;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:RefSeq:WP_007974205.1;locus_tag=xam668_04238;product=partition protein

Xam668_contig195        Prodigal:2.6    mRNA    700     1398    .       -       0       ID=xam668_04238;gene=xam668_04238;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:RefSeq:WP_007974205.1;locus_tag=xam668_04238;product=partition protein

Xam668_contig195        Prodigal:2.6    CDS     700     1398    .       -       0       ID=xam668_04238;gene_name=xam668_04238;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:RefSeq:WP_007974205.1;locus_tag=xam668_04238;product=partition protein

Xam668_contig195        Prodigal:2.6    CDS     700     1398    .       -       0       ID=xam668_04238;gene=xam668_04238;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:RefSeq:WP_007974205.1;locus_tag=xam668_04238;product=partition protein

Xam668_contig195        Prodigal:2.6    CDS     700     1398    .       -       0       ID=xam668_04238;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:RefSeq:WP_007974205.1;locus_tag=xam668_04238;product=partition protein

Xam668_contig195        Prodigal:2.6    gene    700     1398    .       -       0       ID=xam668_04238;Name=xam668_04238

Based on experience with any number of programs that process gff files, including stringtie and cufflinks, each of these files follow the standards. Any help indicating what I should change to make this file fall in line with stringtie's new standard would be great

strand information is missed in GTF output

My output gtf from stringtie contains 3630 gene models with "." strand (i.e. unstranded). So I followed @mpertea previous suggestion and checked my BAM file. I found the reads aligned to these gene model do have strand information, either "+" or "-". but I still got the unstranded gene model. Could anyone shed the light on this?

As I understand, once a read aligns to genome, it must have strand information, either + or -. Are there any exceptions for it without strand?

Enormous memory consumption with large BAM file

Hi,
I'm trying to launch a stringTie assembly on a rather large BAM file (81G), and the call systematically ends with an 'error allocating memory'. I monitored it and saw that it used up to 128 GB of ram (it keeps increasing, and decreases only by tiny amounts from time to time). I tried reducing the number of threads, and using the default -s 1000000, but I get into the same problem. This is puzzling since loci are supposed to be processed one by one and I cannot understand how loading a maximum of 1000000 read pairs, even multithreading, could ever use so much memory. Is this somehow expected?
The reads have been aligned to a spiked-in human genome using HISAT, and I'm using stringTie -G with the refseq annotation. (The fact that it's spiked-in means that technically there are a bit more than 100 chromosomes, if that's of any relevance.)
(I've tried compiling and running the v1.0.4 software on two different linux x86 clusters.)
Thanks in advance,
Pierre-Luc

--help returns exit code 1

It's tough to write an installer that checks to see if stringtie is installed correctly because there isn't a way to check without returning a non-zero exit code. Do you think --help could be swapped to return 0?

strand information is missed in GTF output

I notice there are some transcript/exon rows without strand information in GTF output. I can only see a dot in column #7. This problem make it failed to pass RSEM program. It is caused by uncertain strand or just a bug?

Single exon genes

Whenever I run cuffmerge post Stringtie it gives me "GFF Error: duplicate/invalid 'mRNA' feature" for a number of loci

Seems the only thing they have in common is the fact that they have only 1 exon per gene

Is this a known bug?

Feature missed CDS, start_codon and stop_codon

The feature of my output gtf file from stringtie contains only "exon" and "transcript", no "CDS"," start_codon" and "stop_codon". Is that normal? Does anyone have this experience before?

(-x) exclude regions based on given BED file

It would be good to be able to extend the -x option to not just exclude whole chromosomes/contigs from stringtie processing, but also genome regions, and using a BED file would be a natural way to provide that list of regions to exclude.

Error: could not execute cuffcompare

Hi,
when i used stringtie output results gtf format files run cuffmerge , I have got some error, whether anyone can help me?
this is my code:
cuffmerge -g gene.gtf -s genome.fa -o stringtie_cuffmerge assemblies.txt
cuffmerge error:
Error (GFaSeqGet): end coordinate (65894153) cannot be larger than sequence length 65894135
Error (GFaSeqGet): end coordinate (65894153) cannot be larger than sequence length 65894135
Error (GFaSeqGet): end coordinate (65894153) cannot be larger than sequence length 65894135
Error (GFaSeqGet): end coordinate (65894153) cannot be larger than sequence length 65894135
Error (GFaSeqGet): end coordinate (511) cannot be larger than sequence length 502
Error (GFaSeqGet): end coordinate (515) cannot be larger than sequence length 498
Error (GFaSeqGet): end coordinate (495) cannot be larger than sequence length 490
Error (GFaSeqGet): subsequence cannot be larger than 484
Error getting subseq for CUFF.60225.1 (1..502)!
[FAILED]
Error: could not execute cuffcompare
Thanks.

Exons not being trimmed in ssRNAseq data, combining adjacent genes

Hello,

I've been playing with StringTie for the past few months and I've come to believe that - with the strand-specific RNAseq data that I have, at least - the trimming of exons at drops in coverage is not occurring.

I observe this behaviour for both the first/last exon/UTRs and also internal exons; the example I've attached below shows the gene on the right having very good coverage over the last exon/UTR and a dropoff in coverage where the RefSeq reference transcript ends; however, StringTie additionally assembles the reads between the two genes into one long exon, and so ends up stitching together two transcripts into one artefactual one.

20150812_igv_02_no_trimming

This particular assembly was run with minimum isoform fraction of 0.2 - if I lower this to 0.1 or 0.05 there are many, many more untrimmed exons (presumably because they are not filtered out because they pass this minimum isoform threshold).

Cheers!

Error at GBitVec: index 25 out of bounds

When I used stringtie v 1.2.3, I got:

Error at GBitVec: index 25 out of bounds (size 25)
*** glibc detected *** stringtie: double free or corruption (!prev): 0x0000000001caccb0 ***

The input is from Hista (-dta)

stringtie mpout/Gallus_gallus/GSM752558/hisat2_sorted.bam -p 4 -G $tdir/${spp}.gtf -l $spp -o stoutENS/Gallus_gallus/GSM752558/stringtie.gtf -A stoutENS/Gallus_gallus/GSM752558/gene_abund.tab -B

problem when convert gtf to gff

When I try to conver the .gtf to .gff, I encountered the following error message.
Could you help me to deal with this? Thank you.

cufflinks2gff3 strtie_merge.gtf > strtie_merge.gff

Use of uninitialized value $score in join or string at /home/pyh/bin/maker/bin/cufflinks2gff3 line 94, line 221531.
Use of uninitialized value $score in join or string at /home/pyh/bin/maker/bin/cufflinks2gff3 line 94, line 221532.
Use of uninitialized value $score in join or string at /home/pyh/bin/maker/bin/cufflinks2gff3 line 94, line 221533.
Use of uninitialized value $score in join or string at /home/pyh/bin/maker/bin/cufflinks2gff3 line 94, line 221534.
Use of uninitialized value $score in join or string at /home/pyh/bin/maker/bin/cufflinks2gff3 line 94, line 221535.
Use of uninitialized value $score in join or string at /home/pyh/bin/maker/bin/cufflinks2gff3 line 94, line 221536.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.