magdoll / sqanti2 Goto Github PK

View Code? Open in Web Editor NEW

38.0 38.0 15.0 74.05 MB

SQANTI2 is now replaced by SQANTI3. Please go to: https://github.com/ConesaLab/SQANTI3

License: Other

Python 59.32% R 26.78% Perl 13.90%

sqanti2's People

Contributors

Stargazers

Watchers

Forkers

ddpinto wenmm csf-ngs nlapalu ydliu-hit pythseq mdeloger silvidc leosfan linzhi2013 yejg2017 wenmore loalon genomicsnx wx904

sqanti2's Issues

subprocess.CalledProcessError in PERL

Hi Elizabeth,

I had an error (subprocess.CalledProcessError) with SQANTI2 (v2.8) when I am in "Predicting ORF" step, and it is strange that this error just happened in one of my samples (other data could get the classification result).

Here is the log:

Cleaning up isoform IDs...
Cleaned up isoform fasta file written to: /DATA/test_OC_1
_2/collapse/OC_1_2.collapsed.rep.renamed.fasta
Error corrected FASTA /DATA/test_OC_1_2/annotation/OC_1_2
.collapsed.rep.renamed_corrected.fasta already exists. Using it...
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
        LANGUAGE = (unset),
        LC_ALL = (unset),
        LANG = "en_US.utf8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
Use of uninitialized value in open at /DATA/software/SQANTI2/utilities/gmst/gmst.pl line 885, <$FA> line 94.
**** Running SQANTI...
**** Parsing provided files....
Reading genome fasta /DATA/database/mm10/reference/mm10.sort.fa....
**** Predicting ORF sequences...
Traceback (most recent call last):
  File "/DATA/software/SQANTI2/sqanti_qc2.py", line 1599, in <module>
    main()
  File "/DATA/software/SQANTI2/sqanti_qc2.py", line 1595, in main
    run(args)
  File "/DATA/software/SQANTI2/sqanti_qc2.py", line 1248, in run
    orfDict = correctionPlusORFpred(args, genome_dict)
  File "/DATA/software/SQANTI2/sqanti_qc2.py", line 480, in correctionPlusORF
pred
    if subprocess.check_call(cmd, shell=True, cwd=gmst_dir)!=0:
  File "/DATA/anaconda2/envs/rna/lib/python2.7/subprocess.py", line 190, in c
heck_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'perl /DATA/software/SQANTI2/utilities
/gmst/gmst.pl -faa --strand direct --fnn --output /DATA/t
est_OC_1_2/annotation/GMST/GMST_tmp /DATA/test_OC_1_2/annotation/OC_1_2.collapsed.rep.renamed_corrected.fasta' returned non-zero exit status 2

Thanks,
Y.Zhang

sqanti_filter2.py is not working

Hello,

The sqanti_qc2.py is working, but sqanti_filter2.py isn't as below. It seems to be the python script issue? I will appreciate it if you will help me to fix the problem. Thank you!

Taehee

$ sqanti_filter2.py
/opt/SQANTI2/20190822/envs/anaCogent3/bin/sqanti_filter2.py: line 1: author: command not found
/opt/SQANTI2/20190822/envs/anaCogent3/bin/sqanti_filter2.py: line 2: version: command not found
/opt/SQANTI2/20190822/envs/anaCogent3/bin/sqanti_filter2.py: line 12:
Lightweight filtering of SQANTI by using .classification.txt output

Only keep Iso-Seq isoforms if:
The isoform is FSM, ISM, or NIC and (does not have intrapriming or has polyA_motif)
The isoform is NNC, does not have intrapriming/or polyA motif, not RT-switching, and all junctions are either all canonical or short-read-supported
The isoform is antisense, intergenic, genic, does not have intrapriming/or polyA motif, not RT-switching, and all junctions are either all canonical or short-read-supported

: No such file or directory
import: unable to open X server ' @ error/import.c/ImportImageCommand/369. import: unable to open X server ' @ error/import.c/ImportImageCommand/369.
/opt/SQANTI2/20190822/envs/anaCogent3/bin/sqanti_filter2.py: line 16: from: command not found
/opt/SQANTI2/20190822/envs/anaCogent3/bin/sqanti_filter2.py: line 17: from: command not found
/opt/SQANTI2/20190822/envs/anaCogent3/bin/sqanti_filter2.py: line 18: from: command not found
/opt/SQANTI2/20190822/envs/anaCogent3/bin/sqanti_filter2.py: line 19: from: command not found
/opt/SQANTI2/20190822/envs/anaCogent3/bin/sqanti_filter2.py: line 21: syntax error near unexpected token (' /opt/SQANTI2/20190822/envs/anaCogent3/bin/sqanti_filter2.py: line 21: utilitiesPath = os.path.dirname(os.path.realpath(file))+"/utilities/"'

Error in sqanti_qc2.py

Dear @Magdoll

Hello I ran your updated version of SQANTI2 with script below

python /appl/sqanti2_2/SQANTI2/sqanti_qc2.py -t 30 -c illumina/PM-AU-0002-N-A1SJ.out.tab 20180817_colon_2N_Nanoflit_q7_pychopper_2.fasta /data/ONT_RNA/reference/Homo_sapiens.GRCh38.93.gtf /data/ONT_RNA/reference/hg38.fa

and the script stopped running due to the following reason

Error in `$<-.data.frame`(`*tmp*`, SJ_type, value = "__SJ") : 
  replacement has 1 row, data has 0
Calls: $<- -> $<-.data.frame
Execution halted
Traceback (most recent call last):
  File "/appl/sqanti2_2/SQANTI2/sqanti_qc2.py", line 1515, in <module>
    main()
  File "/appl/sqanti2_2/SQANTI2/sqanti_qc2.py", line 1511, in main
    run(args)
  File "/appl/sqanti2_2/SQANTI2/sqanti_qc2.py", line 1346, in run
    if subprocess.check_call(cmd, shell=True)!=0:
  File "/usr/local/lib/python2.7/subprocess.py", line 190, in check_call
    raise CalledProcessError(retcode, cmd)

Script has ran just fine without the -c (illumina SJ file) option, but with this result I'm having this trouble.

Thank you very much for your help!

Jungwoo

Could I use BAM as input rather than FASTQ?

I know the current version of SQANTI2 does not support BAM as input. It uses FASTQ files and runs the alignment. If I want to do some post-alignment filtering before SQANTI2, what I need to do is 1) align FLNC reads on my own, 2) do post-alignment filtering, 3) convert bam to FASTQ, 4) run SQANTI2 which comes with an additional round of alignment using filtered FASTQ.
This is more issue since I use FLNC reads rather clustered reads and I need more post-processing for FLNC reads. For my research, the number of PacBio reads matters to assess the expression of isoforms. Moreover, the clustering procedure discards singleton transcripts which is the evidence of expression of rare isoforms.
Would it be possible to use BAM as input for SQANTI2?

error in sqanti_1c2.py

Hi:

[luping@centos split-F-gene]$ python /disk/luping/tools/SQANTI2-master/sqanti_qc2.py -t 15 -g ISO-fusion.collapsed.gtf Fusarium_graminearum.RR1.41.chr.gtf ph1.fasta
R scripting front-end version 3.5.1 (2018-07-02)
Cleaning up isoform IDs...
Cleaned up isoform fasta file written to: /disk/luping/bam/iso-bam/pb-cluster-hqlq-total/two-condition/split-F-gene/ISO-fusion.collapsed.renamed.fasta
Traceback (most recent call last):
  File "/disk/luping/tools/SQANTI2-master/sqanti_qc2.py", line 1395, in <module>
    main()
  File "/disk/luping/tools/SQANTI2-master/sqanti_qc2.py", line 1384, in main
    if args.aligner_choice == "minimap2":
AttributeError: 'Namespace' object has no attribute 'aligner_choice'
[luping@centos split-F-gene]$ minimap2 --version
2.14-r894-dirty

What's wrong with this?

Possibility for Nanopore data

Hi,

This pipeline looks great, and I am wanting to try and run it for my data which is from direct RNA with Nanopore. Do you think this is possible? I tried running it with a fasta file but got an error about the read ID.

Thank you for your help.

AssertionError on sqanti_qc2.py v4.0 and cDNA_Cupcake v8.5

Hi Elizabeth,

I am running sqanti_qc2.py on version 4.0, use your example data to test and my commands are like this:
python sqanti_qc2.py -t 6 touse.rep.fasta gencode.v31.annotation.gtf hg38.genome.fa --cage_peak hg38.cage_peak_phase1and2combined_coord.bed

In my STDERR, I get this Error :

R scripting front-end version 3.4.4 (2018-03-15)
Cleaning up isoform IDs...
Cleaned up isoform fasta file written to: /home/hcd_lab/software/SQANTI2-master/example/touse.rep.renamed.fasta
[M::mm_idx_gen::42.9061.75] collected minimizers
[M::mm_idx_gen::54.0572.60] sorted minimizers
[M::main::54.0572.60] loaded/built the index for 25 target sequence(s)
[M::mm_mapopt_update::56.5132.53] mid_occ = 763
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 25
[M::mm_idx_stat::57.9332.49] distinct minimizers: 167178949 (35.44% are singletons); average occurrences: 6.015; average spacing: 3.071
[M::worker_pipeline::65.1642.88] mapped 4730 sequences
[M::main] Version: 2.17-r954-dirty
[M::main] CMD: minimap2 -ax splice --secondary=no -C5 -uf -t 6 /media/huyueming/disk1/data/hg38/hg38.genome.fa /home/hcd_lab/software/SQANTI2-master/example/touse.rep.renamed.fasta
[M::main] Real time: 65.291 sec; CPU: 187.598 sec; Peak RSS: 18.696 GB
output written to /home/hcd_lab/software/SQANTI2-master/example/touse.rep.renamed_corrected.fasta
**** Parsing Isoforms....
**** Running SQANTI...
**** Parsing provided files....
Reading genome fasta /media/huyueming/disk1/data/hg38/hg38.genome.fa....
****Aligning reads with Minimap2...
**** Predicting ORF sequences...
**** Parsing Reference Transcriptome....
Splice Junction Coverage files not provided.
**** Reading CAGE Peak data.
**** Performing Classification of Isoforms....
Number of classified isoforms: 4730
Traceback (most recent call last):
File "/home/hcd_lab/software/SQANTI2-master/sqanti_qc2.py", line 1762, in
main()
File "/home/hcd_lab/software/SQANTI2-master/sqanti_qc2.py", line 1758, in main
run(args)
File "/home/hcd_lab/software/SQANTI2-master/sqanti_qc2.py", line 1405, in run
write_collapsed_GFF_with_CDS(isoforms_info, corrGTF, corrGTF+'.cds.gff')
File "/home/hcd_lab/software/SQANTI2-master/sqanti_qc2.py", line 399, in write_collapsed_GFF_with_CDS
for r in reader:
File "/home/hcd_lab/software/anaconda2/lib/python2.7/site-packages/cupcake-8.5-py2.7-linux-x86_64.egg/cupcake/io/GFF.py", line 393, in next
return self.read()
File "/home/hcd_lab/software/anaconda2/lib/python2.7/site-packages/cupcake-8.5-py2.7-linux-x86_64.egg/cupcake/io/GFF.py", line 550, in read
assert raw[2] == 'transcript'
AssertionError

Intergenic classification

Hi @Magdoll,

I have run my Iso-Seq data through SQANTI and I noticed that some of the transcripts that are classified as "intergenic" are multi-exon transcripts that overlap a mono-exon transcript in the reference annotation. Based on the code, I can see why this happens since multi-exon transcripts are only compared to multi-exon transcripts from the reference. I'm not sure that classifying these transcripts as "intergenic" is logical though. Perhaps they would better fit under NNC? Or an additional label?

[v5.1.0] Error: invalid feature coordinates (end<start!)

Hello @Magdoll ,

I've tried the newest version of SQANTI2 (v5.1.0) and saw the following log:

[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -ax splice -uf -k14 -C5 --secondary=no -t 50 hg38_chr.fa file1.fasta
[M::main] Real time: 1452.274 sec; CPU: 68896.896 sec; Peak RSS: 54.618 GB
Error: invalid feature coordinates (end<start!) at line:
chr5 hg38_chr exon 40824922 40824921 . - . ID=94a3dcb5-de7c-4d08-b571-e2d0c5533afa_dup2.exon2;Name=94a3dcb5-de7c-4d08-b571-e2d0c5533afa_dup2.exon2;Parent=94a3dcb5-de7c-4d08-b571-e2d0c5533afa_dup2
Error: invalid feature coordinates (end<start!) at line:
chr2 hg38_chr exon 88935909 88935908 . - . ID=65ccca1d-b8ee-40ef-9b7a-2a617adf3cfe.exon2;Name=65ccca1d-b8ee-40ef-9b7a-2a617adf3cfe.exon2;Parent=65ccca1d-b8ee-40ef-9b7a-2a617adf3cfe

I've modified the minimap2 option for Nanopore direct RNA, but didn't change anything. there are only 4 lines saying "invalid feature coordinates", but it bothers me nonetheless. Could you help me on understanding what's going on? Thanks!

SQANTI2 fails at ORF finding

Posting the original issue here first....

I've mapped some isoseq data and then collapsed it with TAMA and am now trying to do the final processing in SQANTI2 - I can't get past this error:

(/opt/sqanti2/4.1/anaCogent3) [rwr002@node35 ISOSEQ_gmap_tama]$ python /opt/sqanti2/4.1/SQANTI2-4.1/sqanti_qc2.py --aligner_choice gmap -x genomes/GRCh38 -t 18 -o FAM43 -d TESTOUT fam43_isoseq_out.fastq.split.tama.renamed.fasta genomes/gencode.v32.annotation.gtf genomes/hg38.fa

R scripting front-end version 3.2.3 (2015-12-10)

Cleaning up isoform IDs...

Cleaned up isoform fasta file written to: /gpfs0/home/gdlauberlab/rwr002/TESTING/ISOSEQ_gmap_tama/fam43_isoseq_out.fastq.split.tama.renamed.renamed.fasta

**** Running SQANTI...

**** Parsing provided files....

Reading genome fasta /gpfs0/home/gdlauberlab/rwr002/TESTING/ISOSEQ_gmap_tama/genomes/hg38.fa....

Aligned SAM /gpfs0/home/gdlauberlab/rwr002/TESTING/ISOSEQ_gmap_tama/TESTOUT/fam43_isoseq_out.fastq.split.tama.renamed.renamed_corrected.sam already exists. Using it...

output written to /gpfs0/home/gdlauberlab/rwr002/TESTING/ISOSEQ_gmap_tama/TESTOUT/fam43_isoseq_out.fastq.split.tama.renamed.renamed_corrected.fasta

**** Predicting ORF sequences...

terminate called after throwing an instance of 'std::length_error'

  what():  basic_string::_S_create

GeneMarkS: error on last system call, error code 134

Abort program!!!

Traceback (most recent call last):

  File "/opt/sqanti2/4.1/SQANTI2-4.1/sqanti_qc2.py", line 1793, in <module>

    main()

  File "/opt/sqanti2/4.1/SQANTI2-4.1/sqanti_qc2.py", line 1789, in main

    run(args)

  File "/opt/sqanti2/4.1/SQANTI2-4.1/sqanti_qc2.py", line 1414, in run

    orfDict = correctionPlusORFpred(args, genome_dict)

  File "/opt/sqanti2/4.1/SQANTI2-4.1/sqanti_qc2.py", line 555, in correctionPlusORFpred

    if subprocess.check_call(cmd, shell=True, cwd=gmst_dir)!=0:

  File "/opt/sqanti2/4.1/anaCogent3/lib/python2.7/subprocess.py", line 190, in check_call

    raise CalledProcessError(retcode, cmd)

subprocess.CalledProcessError: Command 'perl /gpfs0/export/opt/sqanti2/4.1/SQANTI2-4.1/utilities/gmst/gmst.pl -faa --strand direct --fnn --output /gpfs0/home/gdlauberlab/rwr002/TESTING/ISOSEQ_gmap_tama/TESTOUT/GMST/GMST_tmp /gpfs0/home/gdlauberlab/rwr002/TESTING/ISOSEQ_gmap_tama/TESTOUT/fam43_isoseq_out.fastq.split.tama.renamed.renamed_corrected.fasta' returned non-zero exit status 1

Half of the HQ isoforms are classified as antisense and NIC

I've done SQANTI2 on PacBio Sequal high-quality collapsed isoforms from Iso-Seq + Tofu. I found more than 50% of the isoforms are classified as "antisense" and "NIC". I used GMAP to map against Ensembl mouse genome.

I also tried proovread to correct high-quality isoforms by short reads, and from SQANTI2 report, number of "NIC" dropped and "ISM" increased but "antisense" still around 25%.

Is it normal to see half of isoforms as "antisense" and "NIC"? Or is it because not having enough coverage?

Thanks!

Existing files from "incomplete" runs cause errors on subsequent runs

I'm in the middle of prepping a script to perform this analysis, and I'd forgotten to enable the gmap module on my system - so sqanti kind of ran but didn't. Subsequently, it continued to fail until I deleted the files that were created during the aborted attempt - apparently it doesn't like having pre-existing files:

**** Predicting ORF sequences...
terminate called after throwing an instance of 'std::length_error'
what(): basic_string::_S_create
GeneMarkS: error on last system call, error code 134
Abort program!!!
Traceback (most recent call last):
File "/opt/sqanti2/4.1/SQANTI2-4.1/sqanti_qc2.py", line 1793, in
main()
File "/opt/sqanti2/4.1/SQANTI2-4.1/sqanti_qc2.py", line 1789, in main
run(args)
File "/opt/sqanti2/4.1/SQANTI2-4.1/sqanti_qc2.py", line 1414, in run
orfDict = correctionPlusORFpred(args, genome_dict)
File "/opt/sqanti2/4.1/SQANTI2-4.1/sqanti_qc2.py", line 555, in correctionPlusORFpred
if subprocess.check_call(cmd, shell=True, cwd=gmst_dir)!=0:
File "/opt/sqanti2/4.1/anaCogent3/lib/python2.7/subprocess.py", line 190, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'perl /gpfs0/export/opt/sqanti2/4.1/SQANTI2-4.1/utilities/gmst/gmst.pl -faa --strand direct --fnn --output /gpfs0/home/gdlauberlab/rwr002/TESTING/ISOSEQ_2/GMST/GMST_tmp /gpfs0/home/gdlauberlab/rwr002/TESTING/ISOSEQ_2/cupcake.gmap.collapsed.rep.renamed_corrected.fasta' returned non-zero exit status 1

TypeError on v3.2

Hi Magdoll,

I ran your updated version of SQANTI2 v3.2 with script below

python ../../../SQANTI2/sqanti_qc2.py
-t 50
test.fasta
../../../references/gencode.v31.annotation.gtf.gz
../../../references/hg38_noALT.fa

and encountered the following errors -

**** Performing Classification of Isoforms....
Traceback (most recent call last):
File "../../../SQANTI2/sqanti_qc2.py", line 1677, in
main()
File "../../../SQANTI2/sqanti_qc2.py", line 1673, in main
run(args)
File "../../../SQANTI2/sqanti_qc2.py", line 1325, in run
isoforms_info = isoformClassification(args, isoforms_by_chr, refs_1exon_by_chr, refs_exons_by_chr, junctions_by_chr, junctions_by_gene, start_ends_by_gene, genome_dict, indelsJunc, orfDict)
File "../../../SQANTI2/sqanti_qc2.py", line 1225, in isoformClassification
dist_to_last_junc = rec.junctions[0][1] - orfDict[rec.id].cds_genomic_end
TypeError: unsupported operand type(s) for -: 'int' and 'NoneType'

Do you have any suggestions to fix the error?
The same command work fine previously on v2.8.

Errors in running sqanti2

Hi @Magdoll,
I met some problems when running sqanti2, and need your help to figure them out.
First, when running sqanti_qc2.py, I got the below error message:
R scripting front-end version 3.5.2 (2018-12-20)
Cleaning up isoform IDs...
Cleaned up isoform fasta file written to: /data/daij/project/RTEL1_Sharon/CCS/hg38/splicing/sqanti/sqanti2/new/fasta/ccs.NP0328-157_3078_RTEL1_lbc65.lbc65--lbc65.renamed.fasta
[M::mm_idx_gen::123.3301.73] collected minimizers
[M::mm_idx_gen::149.0952.55] sorted minimizers
[M::main::149.0952.55] loaded/built the index for 195 target sequence(s)
[M::mm_mapopt_update::154.7712.49] mid_occ = 751
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 195
[M::mm_idx_stat::157.5852.47] distinct minimizers: 167240184 (35.46% are singletons); average occurrences: 6.007; average spacing: 3.086
[M::worker_pipeline::223.7864.05] mapped 15901 sequences
[M::main] Version: 2.15-r905
[M::main] CMD: minimap2 -ax splice --secondary=no -C5 -uf -t 8 /fdb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa /data/daij/project/RTEL1_Sharon/CCS/hg38/splicing/sqanti/sqanti2/new/fasta/ccs.NP0328-157_3078_RTEL1_lbc65.lbc65--lbc65.renamed.fasta
[M::main] Real time: 224.164 sec; CPU: 906.497 sec; Peak RSS: 18.434 GB
output written to /gpfs/gsfs8/users/daij/project/RTEL1_Sharon/CCS/hg38/splicing/sqanti/sqanti2/new/ccs.NP0328-157_3078_RTEL1_lbc65.lbc65--lbc65.renamed_corrected.fasta
Skipping m54137_181210_213828/5308570/ccs because unmapped.
...
Skipping m54137_181210_213828/56033863/ccs because unmapped.
error in command line
GeneMarkS: error on last system call, error code 256
Abort program!!!
Aligner choice: Minimap2.
**** Running SQANTI...
**** Parsing provided files....
Reading genome fasta /fdb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa....
****Aligning reads with Minimap2...
**** Predicting ORF sequences...
Traceback (most recent call last):
File "/data/daij/SQANTI2/sqanti_qc2.py", line 1395, in
main()
File "/data/daij/SQANTI2/sqanti_qc2.py", line 1391, in main
run(args)
File "/data/daij/SQANTI2/sqanti_qc2.py", line 1081, in run
orfDict = correctionPlusORFpred(args, genome_dict)
File "/data/daij/SQANTI2/sqanti_qc2.py", line 412, in correctionPlusORFpred
if subprocess.check_call(cmd, shell=True, cwd=gmst_dir)!=0:
File "/data/daij/miniconda2/envs/anaCogent5.2/lib/python2.7/subprocess.py", line 190, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'perl /gpfs/gsfs8/users/daij/SQANTI2/utilities/gmst/gmst.pl -faa --strand direct --fnn --output /gpfs/gsfs8/users/daij/project/RTEL1_Sharon/CCS/hg38/splicing/sqanti/sqanti2/new/GMST/GMST_tmp /gpfs/gsfs8/users/daij/project/RTEL1_Sharon/CCS/hg38/splicing/sqanti/sqanti2/new/ccs.NP0328-157_3078_RTEL1_lbc65.lbc65--lbc65.renamed_corrected.fasta' returned non-zero exit status 1

Second, when running sqanti_filter2.py, I got another error message:

Running SQANTI filtering...

Traceback (most recent call last):
File "/data/daij/SQANTI2/sqanti_filter2.py", line 119, in
main()
File "/data/daij/SQANTI2/sqanti_filter2.py", line 114, in main
sqanti_filter_lite(args)
File "/data/daij/SQANTI2/sqanti_filter2.py", line 55, in sqanti_filter_lite
cat = CATEGORY_DICT[r['structural_category']]
KeyError: 'fusion'

Do you have any idea about that?

Thanks a lot,

Jieqiong

input isoforms.fasta for chain_samples.py

Hi Liz,

Sorry for my misunderstanding, which input isoforms.fasta is needed when using a multi-sample FL Count file produced by the chain_samples.py?
I generated a multisample FL count from 10 multiplexed tissues. I have collapsed fasta files for each demultiplexed sample. Shall I generate somehow a merged fasta for all 10-plex as input isoforms for SQANTI2 to be analyzed with the multi-sample FL count data?

Thank you.
Pablo

python2 and python3 hybird ?

mapping error in sqanti_qc2.py

Hey Liz,
I'm running your latest version of sqanti2 (v2.6, and cupcake at v6.8) and I'm getting an error when the script calls gmap to align the isoforms (in bold below). It looks like the gmap arguments are out of order: the gmap directory is being passed to the gmap command as "--cross-species" instead of as the path I provided in the -x parameter. The script I ran was:

python /SQANTI2/sqanti_qc2.py
--aligner_choice=gmap
--cage_peak /hg38.cage_peak_phase1and2combined_coord.bed
--polyA_motif_list /human_polyA_list.txt
-x /GRCh38.p12 -t 24 -z
-o BS6.isoseq3_polished.hq.uniq.sorted.collapsed.filtered.rep -d
-c /intropolis.v1.hg19_with_liftover_to_hg38.tsv
-fl /BS6.isoseq3_polished.hq.uniq.sorted.collapsed.filtered.abundance.txt
/BS6.isoseq3_polished.all.uniq.sorted.collapsed.filtered.rep.fa
/gencode.v30.chr_patch_hapl_scaff.annotation.gtf
/GRCh38.p12.genome.fa

Thanks in advance for checking this out!
Best,
Nancy

####################################################################
R scripting front-end version 3.3.1 (2016-06-21)
Cleaning up isoform IDs...
Cleaned up isoform fasta file written to: /BS6.isoseq3_polished.hq.uniq.sorted.collapsed.filtered.rep.renamed.fasta

GMAP version 2019-02-15 called with args: gmap.sse42 -D --cross-species -n 1 --max-intronlength-middle=2000000 --max-intronlength-ends=2000000 -L 3000000 -f samse -t 24 <gmap_path> -d GRCh38.p12 -z sense_force /BS6.isoseq3_polished.hq.uniq.sorted.collapsed.filtered.rep.renamed.fasta
Note: -n 1 will not report chimeric alignments. If you want a single alignment plus chimeras, use -n 0 instead.

Checking compiler assumptions for SSE2: 6B8B4567 327B23C6 xor=59F066A1
Checking compiler assumptions for SSE4.1: -103 -58 max=198 => compiler zero extends
Checking compiler options for SSE4.2: 6B8B4567 __builtin_clz=1 __builtin_ctz=0 _mm_popcnt_u32=17 __builtin_popcount=17
Finished checking compiler assumptions
Unable to find genome directory --cross-species
Either recompile the GMAP package to have the correct default directory (seen by doing gmap --version),
or use the -D flag to gmap to specify the correct genome directory.

python: can't open file 'sqanti_qc2.py': [Errno 2] No such file or directory

Hello, I have an installation problem.Any help will be appreciated.
I installed it according to the process,but when I type python sqanti_qc2.py , terminal display a error that python: can't open file 'sqanti_qc2.py': [Errno 2] No such file or directory
after type python setup.py install ,the terminal display:

Installing out_to_chain.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing bed_count_overlapping.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing wiggle_to_array_tree.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing tfloc_summary.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing maf_interval_alignibility.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing maf_thread_for_species.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing maf_print_scores.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing prefix_lines.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing maf_col_counts.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing interval_join.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing maf_extract_chrom_ranges.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing bed_bigwig_profile.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing bed_subtract_basewise.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing maf_limit_to_species.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing bed_diff_basewise_summary.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing bed_coverage.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing maf_to_axt.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing line_select.py script to /home/dsy/anaconda3/envs/sqanti2/bin

Using /home/dsy/anaconda3/envs/sqanti2/lib/python3.7/site-packages/bx_python-0.8.6-py3.7-linux-x86_64.egg
Searching for biopython==1.76
Best match: biopython 1.76
Adding biopython 1.76 to easy-install.pth file

Using /home/dsy/anaconda3/envs/sqanti2/lib/python3.7/site-packages
Searching for scikit-learn==0.22.1
Best match: scikit-learn 0.22.1
Processing scikit_learn-0.22.1-py3.7-linux-x86_64.egg
scikit-learn 0.22.1 is already the active version in easy-install.pth

Using /home/dsy/anaconda3/envs/sqanti2/lib/python3.7/site-packages/scikit_learn-0.22.1-py3.7-linux-x86_64.egg
Searching for six==1.13.0
Best match: six 1.13.0
Adding six 1.13.0 to easy-install.pth file

Using /home/dsy/anaconda3/envs/sqanti2/lib/python3.7/site-packages
Searching for scipy==1.4.1
Best match: scipy 1.4.1
Processing scipy-1.4.1-py3.7-linux-x86_64.egg
scipy 1.4.1 is already the active version in easy-install.pth

Using /home/dsy/anaconda3/envs/sqanti2/lib/python3.7/site-packages/scipy-1.4.1-py3.7-linux-x86_64.egg
Searching for joblib==0.14.1
Best match: joblib 0.14.1
Processing joblib-0.14.1-py3.7.egg
joblib 0.14.1 is already the active version in easy-install.pth

Using /home/dsy/anaconda3/envs/sqanti2/lib/python3.7/site-packages/joblib-0.14.1-py3.7.egg
Finished processing dependencies for cupcake==9.1.1

Does this display indicate that the installation is successful？
Thank you for your help!

Support SAM/BAM input for SQANTI2

args.aligner_choice error when running with -g option

When running squanti2 with the -g option (i.e. providing a gtf rather than an input fasta file), the program fails with a namespace error.

Questions about the results

Hi @Magdoll,

I ran Sqanti2 and I have some questions about the results.
First, I noticed that a total of 5475 sequences got added to the .renamed_corrected.fasta file. In the classification output file, they have the suffix _dup2 or _dup3. I tried aligning some of these sequences (for example transcript/22 and transcript/22_dup2), but there seems to be no significant similarity between those sequences. Could you give me some clarification about where these sequences came from?

Also, on your page, you stated that field 27 (FSM_class) from the classification output file should be ignored. Does this mean that the following fields (28: ORF_length, 29: CDS_length, 30:
CDS_start and 31: CDS_end) should also be ignored?

Melissa

Error in sqanti_qc2.py

Hello, I am not entirely sure why I am getting the error. Any help will be appreciated.

Sqanti.pbs

#!/bin/bash
#PBS -P BLRseq
#PBS -N sqanti
#PBS -l select=1:ncpus=8:mem=64GB
#PBS -l walltime=20:00:00
#PBS -e /scratch/BLRseq/BLR/results/annotation/structural/BLR560/isoseq/sqanti/sqanti.err
#PBS -o /scratch/BLRseq/BLR/results/annotation/structural/BLR560/isoseq/sqanti/sqanti.out

wdir=/scratch/BLRseq/BLR
outdir=$wdir/results/annotation/structural/BLR560/isoseq/sqanti
ref=$wdir/results/annotation/structural/BLR560/ref.fa
cds=$wdir/data/BLR560/isoseq/SMRT/cluster/hq_isoforms.fasta
gtf=$outdir/augustus.ab_initio.gtf

sqanti=/scratch/BLRseq/bin/SQANTI2
cupcake=/scratch/BLRseq/bin/cDNA_Cupcake
module load python
module load cufflinks/2.2.1
module load ucsc-userapps/348
module load R/3.3.2
module load minimap2/2.3
module load samtools/1.9
export PYTHONPATH=$PYTHONPATH:$cupcake/sequence

minimap2 -ax splice -uf $ref $cds > $outdir/aln.sam
samtools sort -O sam -o $outdir/aln.sorted.sam -@ 8 $outdir/aln.sam
python $cupcake/cupcake/tofu/collapse_isoforms_by_sam.py --input $cds -s $outdir/aln.sorted.sam -o $outdir/hq --dun-merge-5-shorter
python $sqanti/sqanti_qc2.py --aligner_choice minimap2 -t 8 --output BLR560 --dir $outdir $outdir/hq.collapsed.rep.fa $gtf $ref

$ tail -20 sqanti.err

[M::worker_pipeline::8.938*3.74] mapped 8843 sequences
[M::main] Version: 2.3-r545-dirty
[M::main] CMD: minimap2 -ax splice --secondary=no -C5 -uf -t 8 /scratch/BLRseq/BLR/results/annotation/structural/BLR560/ref.fa /scratch/BLRseq/BLR/results/annotation/structural/BLR560/isoseq/sqanti/hq.collapsed.rep.renamed.fasta
[M::main] Real time: 8.965 sec; CPU: 33.463 sec
output written to /scratch/BLRseq/BLR/results/annotation/structural/BLR560/isoseq/sqanti/hq.collapsed.rep.renamed_corrected.fasta
**** Parsing Isoforms....
Traceback (most recent call last):
  File "/scratch/BLRseq/bin/SQANTI2/sqanti_qc2.py", line 1762, in <module>
    main()
  File "/scratch/BLRseq/bin/SQANTI2/sqanti_qc2.py", line 1758, in main
    run(args)
  File "/scratch/BLRseq/bin/SQANTI2/sqanti_qc2.py", line 1405, in run
    write_collapsed_GFF_with_CDS(isoforms_info, corrGTF, corrGTF+'.cds.gff')
  File "/scratch/BLRseq/bin/SQANTI2/sqanti_qc2.py", line 399, in write_collapsed_GFF_with_CDS
    for r in reader:
  File "/home/psur9757/.local/lib/python2.7/site-packages/cupcake/io/GFF.py", line 393, in next
    return self.read()            
  File "/home/psur9757/.local/lib/python2.7/site-packages/cupcake/io/GFF.py", line 550, in read
    assert raw[2] == 'transcript'
AssertionError

Dependency for STAR

Hi Liz,

I am unsure what is the python STAR module that you refer to in the line 44 of sqanti_qc2.py? Thanks!

A question about "associated_gene" from _classification.txt

Hi Elizabeth,

I am sorry for asking a question rather than reporting an error here (I can't find another way to contact you).

When I see an "associated_gene" columns from a result file ( *classification.txt ), sometime I can see two gene names are connected by underscore (""), which isn't in the reference. First I thought it could mean a fused gene, but it said "genic" at structural_category, not "fusion", so I am not sure how to interpret this. Could you please explain what these types of "associated genes" means? Thanks a lot!

More specific instructions installing cDNA_cupcake

FYI, it appears that recently, cDNA_Cupcake created a python 2.7 branch. The most recent master only supports python 3.7. In your installation instructions, perhaps suggest after cloning the repo changing to the 2.7 branch?

E.g.

cd cDNA_Cupcake
git checkout origin/Py2_v8.7.x
python setup.py build
 pip install --prefix=/some/path/cDNA_Cupcake/ --ignore-installed .

Best,
Ali

will SQANTI2 works on custom fasta file?

My fasta file is not the ouput from pacbio (from nanopore instead). and I got the error as such:

Invalid input IDs! Expected PB.X.Y or PB.X.Y|xxxxx or PBfusion.X format but saw ENSG00000197956.9_153534599_153535991_1 instead. Abort!

is it possibly to use custom fasta file as input for SQANTI2?

AttributeError when running squanti2

Ok, I'm making progress in getting the latest version of squanti2 to run, but I've now hit the error below. I wonder if there has been a change in biopython that causes the issue or if there is any other version incompatibility. Have you perhaps encountered this issue before?

Traceback (most recent call last):
  File "/sc/orga/projects/vanbah01a/opt/SQANTI2//sqanti_qc2.py", line 1539, in <module>
    main()
  File "/sc/orga/projects/vanbah01a/opt/SQANTI2//sqanti_qc2.py", line 1535, in main
    run(args)
  File "/sc/orga/projects/vanbah01a/opt/SQANTI2//sqanti_qc2.py", line 1181, in run
    orfDict = correctionPlusORFpred(args, genome_dict)
  File "/sc/orga/projects/vanbah01a/opt/SQANTI2//sqanti_qc2.py", line 363, in correctionPlusORFpred
    err_correct(args.genome, corrSAM, corrFASTA, genome_dict=genome_dict)
  File "/sc/orga/projects/vanbah01a/opt/cDNA_Cupcake/sequence/err_correct_w_genome.py", line 31, in err_correct
    seq = sp.consistute_genome_seq_from_exons(genome_dict, r.sID, r.segments, r.flag.strand)
  File "/sc/orga/projects/vanbah01a/opt/cDNA_Cupcake/sequence/coordinate_mapper.py", line 190, in consistute_genome_seq_from_exons
    return seq.tostring()
AttributeError: 'Seq' object has no attribute 'tostring'

AttributeError running squanti2

I just installed the latest version of SQANTI2, and had this error:
Traceback (most recent call last):
File "SQANTI2-master/sqanti_qc2.py", line 66, in
v1, v2 = map(int, cupcake.version.split('.'))
AttributeError: 'module' object has no attribute 'version'

--is_fusion (subprocess.CalledProcessError:)

Hello Magdoll,

I did run the sqanti_qc2.py according to your suggestion as below;

sqanti_qc2.py -t 30 ../hq_isoforms.fasta.fusion.rep.fq ../gencode.v30.annotation.gtf ../GRCh38.p12.genome.fa --cage_peak ../hg38.cage_peak_phase1and2combined_coord.bed --coverage ../intropolis.v1.hg19_with_liftover_to_hg38.tsv.min_count_10.modified --is_fusion

I did replace the collapse fastq with the fusion fastq, which contained the PBfusion.x ids. But, I got an error as below. Could you please let me know what was wrong? Thank you for your help!

R scripting front-end version 3.5.1 (2018-07-02)
Cleaning up isoform IDs...
Cleaned up isoform fasta file written to: /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/nalm6e/fusion/hq_isoforms.fasta.fusion.rep.renamed.fasta
[M::mm_idx_gen::91.5981.86] collected minimizers
[M::mm_idx_gen::123.6722.81] sorted minimizers
[M::main::123.6732.81] loaded/built the index for 593 target sequence(s)
[M::mm_mapopt_update::128.2972.74] mid_occ = 803
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 593
[M::mm_idx_stat::130.4982.71] distinct minimizers: 167309999 (34.26% are singletons); average occurrences: 6.324; average spacing: 3.074
[M::worker_pipeline::131.4052.74] mapped 43 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -ax splice --secondary=no -C5 -ub -t 30 /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/GRCh38.p12.genome.fa /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/nalm6e/fusion/hq_isoforms.fasta.fusion.rep.renamed.fasta
[M::main] Real time: 131.678 sec; CPU: 360.729 sec; Peak RSS: 18.695 GB
output written to /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/nalm6e/squanti2-3/hq_isoforms.fasta.fusion.rep.renamed_corrected.fasta
error: , /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/nalm6e/squanti2-3/hq_isoforms.fasta.fusion.rep.renamed_corrected.fasta
**** Running SQANTI...
**** Parsing provided files....
Reading genome fasta /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/GRCh38.p12.genome.fa....
****Aligning reads with Minimap2...
**** Predicting ORF sequences...
Traceback (most recent call last):
File "/opt/SQANTI2/20190618/bin/sqanti_qc2.py", line 1599, in
main()
File "/opt/SQANTI2/20190618/bin/sqanti_qc2.py", line 1595, in main
run(args)
File "/opt/SQANTI2/20190618/bin/sqanti_qc2.py", line 1248, in run
orfDict = correctionPlusORFpred(args, genome_dict)
File "/opt/SQANTI2/20190618/bin/sqanti_qc2.py", line 480, in correctionPlusORFpred
if subprocess.check_call(cmd, shell=True, cwd=gmst_dir)!=0:
File "/opt/SQANTI2/20190618/lib/python2.7/subprocess.py", line 190, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'perl /opt/SQANTI2/20190618/bin/utilities/gmst/gmst.pl -faa --strand direct --fnn --output /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/nalm6e/squanti2-3/GMST/GMST_tmp /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/nalm6e/squanti2-3/hq_isoforms.fasta.fusion.rep.renamed_corrected.fasta' returned non-zero exit status 1

Result analysis

Dear @Magdoll,

Thank you for providing a wonderful tool for long read data analysis.
I am actually trying running SQANTI2 using cDNA nanopore data and I wanted to ask you some about the result.

When I look at the characterization of transcripts which gives me how many isoforms are falling into each category (FSM, ISM, NIC etc..), I found that about half of isoforms detected as unique isoforms are falling into antisense category. These isoforms are also labeled as novel isoforms. Although I need to check if these reads are the opposite stranded reads or not, but I am wondering whether SQANTI2 can annotate these antisense reads into opposite stranded reads.

Jungwoo

some Questions about the results

Hi, Magdoll.
I follow the Iso-Seq3 -> Cupcake ToFU -> SQATNI pipeline. I have some question about the sqanti result.
First,I noticed the result in XX.collapsed.rep.renamed_corrected.gtf is like

transcript_id "PB.1.1"; gene_id "PB.1.1";

instead of

transcript_id "PB.1.1"; gene_id "PB.1";

Second, I found no gtf output after I run sqanti_filter2.py although I can grep filterd gene in corrected.gtf to make filter.gtf.

Thanks for you pipeline. These are very helpful !

No gene names in the output result

Hi @Magdoll
I tried to make an analysis of transcripts of goat using SQANTI, but I find there are not gene names corresponding to transcripts in the output report. I also find some novel isoforms have their own names ,such an 'novelGene_AS' in the 'antisense or genic', while others don't have. I have checked the 'ref-GTF' and 'ref-GENOME FASTA', and I thought they don't have problems. I make a comparision of your example Ipunt.fa and mine, and I found in my Input.fa there are not corresponding chromosomes , they aren't mapped. So I want to know whether the Input.fa was wrong and how I should do next.
Question.pptx

question about generating gff3 file

Hello @Magdoll,

I used SQANTI2 results to update my old gene annotation results. i want to know is there any quick way for me to generate gff3 format annotations based on SQANTI2 results ?

Best,
Xu

subprocess.CalledProcessError

Hello,

I had an error (subprocess.CalledProcessError) when I ran the SQANTI2 (v3.5.1). I attached the log at below. Could you please give me an advice how to resolve it? Thank you!

Taehee

sqanti_qc2.py -t 30 collapse.collapsed.rep.fq gencode.v30.annotation.gtf GRCh38.p12.genome.fa --cage_peak hg38.cage_peak_phase1and2combined_coord.bed --coverage intropolis.v1.hg19_with_liftover_to_hg38.tsv.min_count_10.modified --fl_count collapse.collapsed.abundance.txt

R scripting front-end version 3.5.1 (2018-07-02)
Cleaning up isoform IDs...
Cleaned up isoform fasta file written to: /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/nalm6k/collapse/collapse.collapsed.rep.renamed.fasta
[M::mm_idx_gen::83.9461.76] collected minimizers
[M::mm_idx_gen::94.8383.25] sorted minimizers
[M::main::94.8393.25] loaded/built the index for 593 target sequence(s)
[M::mm_mapopt_update::100.0843.13] mid_occ = 803
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 593
[M::mm_idx_stat::102.3543.08] distinct minimizers: 167309999 (34.26% are singletons); average occurrences: 6.324; average spacing: 3.074
[M::worker_pipeline::115.0895.66] mapped 23933 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -ax splice --secondary=no -C5 -uf -t 30 /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/GRCh38.p12.genome.fa /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/nalm6k/collapse/collapse.collapsed.rep.renamed.fasta
[M::main] Real time: 115.208 sec; CPU: 651.074 sec; Peak RSS: 23.933 GB
output written to /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/nalm6k/test/collapse.collapsed.rep.renamed_corrected.fasta
**** Parsing Isoforms....
Input pattern: /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/intropolis/intropolis.v1.hg19_with_liftover_to_hg38.tsv.min_count_10.modified. The following files found and to be read as junctions:
/net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/intropolis/intropolis.v1.hg19_with_liftover_to_hg38.tsv.min_count_10.modified
5041602 junctions read. 0 junctions added to both strands because no strand information from STAR.
**** RT-switching computation....
**** Reading Full-length read abundance files...
WARNING: PB.6189.1.gene not found in FL count file. Assign count as 0.
WARNING: PB.5162.3.gene not found in FL count file. Assign count as 0.
WARNING: PB.7490.1.gene not found in FL count file. Assign count as 0.
.
.
.
WARNING: PB.7367.10.gene not found in FL count file. Assign count as 0.
WARNING: PB.2750.7.gene not found in FL count file. Assign count as 0.
WARNING: PB.4315.1.gene not found in FL count file. Assign count as 0.
Isoforms expression files not provided.
**** Writing output files....
**** Generating SQANTI report....

Attaching package: ‘dplyr’

The following object is masked from ‘package:gridExtra’:

combine

The following object is masked from ‘package:reshape’:

rename

The following objects are masked from ‘package:stats’:

filter, lag

The following objects are masked from ‘package:base’:

intersect, setdiff, setequal, union

Warning messages:
1: Removed 1 rows containing missing values (geom_text).
2: Removed 1 rows containing missing values (geom_bar).
3: Removed 1 rows containing missing values (geom_text).
Error: StatBin requires a continuous x variable: the x variable is discrete. Perhaps you want stat="count"?
Execution halted
**** Running SQANTI...
**** Parsing provided files....
Reading genome fasta /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/GRCh38.p12.genome.fa....
****Aligning reads with Minimap2...
**** Predicting ORF sequences...
**** Parsing Reference Transcriptome....
**** Reading Splice Junctions coverage files.
**** Reading CAGE Peak data.
**** Performing Classification of Isoforms....
Number of classified isoforms: 47866
Traceback (most recent call last):
File "/opt/SQANTI2/20190618/bin/sqanti_qc2.py", line 1717, in
main()
File "/opt/SQANTI2/20190618/bin/sqanti_qc2.py", line 1713, in main
run(args)
File "/opt/SQANTI2/20190618/bin/sqanti_qc2.py", line 1531, in run
if subprocess.check_call(cmd, shell=True)!=0:
File "/opt/SQANTI2/20190618/lib/python2.7/subprocess.py", line 190, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/opt/SQANTI2/20190618/bin/Rscript /opt/SQANTI2/20190618/bin/utilities//SQANTI_report2.R /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/nalm6k/test/collapse.collapsed.rep_classification.txt /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/nalm6k/test/collapse.collapsed.rep_junctions.txt' returned non-zero exit status 1

Error during Isoform Classification

Hi Liz,

I have been trying to run SQANTI2, but have run into an error and I'm having a hard time identifying what exactly could be causing it. This is the log out I get from the squanti_qc2.py run:

R scripting front-end version 3.4.3 (2017-11-30) Cleaning up isoform IDs... Cleaned up isoform fasta file written to: cupcake_processing/isoforms.polished.hq.collapsed.rep.renamed.fasta Write arguments to isoforms.collapsed.sqanti2qc.params.txt... **** Running SQANTI... **** Parsing provided files.... Reading genome fasta data/hg38/GRCh38.p13_refseq_genomic.fna.... Error corrected FASTA isoforms.polished.hq.collapsed.rep.renamed_corrected.fasta already exists. Using it... **** Predicting ORF sequences... ORF file isoforms.polished.hq.collapsed.rep.renamed_corrected.faa already exists. Using it.... **** Parsing Reference Transcriptome.... refAnnotation_isoforms.collapsed.sqanti2qc.genePred already exists. Using it. **** Parsing Isoforms.... Splice Junction Coverage files not provided. **** Performing Classification of Isoforms.... Traceback (most recent call last): File "SQANTI2/sqanti_qc2.py", line 1996, in <module> main() File "SQANTI2/sqanti_qc2.py", line 1991, in main run(args) File "SQANTI2/sqanti_qc2.py", line 1613, in run isoforms_info = isoformClassification(args, isoforms_by_chr, refs_1exon_by_chr, refs_exons_by_chr, junctions_by_chr, junctions_by_gene, start_ends_by_gene, genome_dict, indelsJunc, orfDict) File "SQANTI2/sqanti_qc2.py", line 1459, in isoformClassification orfDict[rec.id].cds_genomic_end = m[orfDict[rec.id].cds_end-1] + 1 # make it 1-based KeyError: 1321

As you can see, everything works fine until the Isoform Classification step. Do you have any idea whether it could be a bug, or something related to the particular formatting that the ORF prediction outputs using my data? I'm saying this because it seems to be some sort of error in the creation of the dictionary storing the ORF information...

Thanks!

Ángeles
PhD Student, Conesa Lab

AssertionError on v3.3

Hi , Liz:

I am running sqanti_qc2.py on version 3.3, and my commands are like this:
python sqanti_qc2.py --aligner_choice gmap -x Gbarbadense -t 20 --geneid all.collapsed.rep.fa genome.gtf genome.fa -o test

In my STDERR, I get this Error :

output written to all.collapsed.rep.renamed_corrected.fasta
**** Parsing Isoforms....
Traceback (most recent call last):
File "/export/pipeline/RNASeq/Software/SQANTI2/V3.3/SQANTI2-master/sqanti_qc2.py", line 1746, in
main()
File "/export/pipeline/RNASeq/Software/SQANTI2/V3.3/SQANTI2-master/sqanti_qc2.py", line 1742, in main
run(args)
File "/export/pipeline/RNASeq/Software/SQANTI2/V3.3/SQANTI2-master/sqanti_qc2.py", line 1389, in run
write_collapsed_GFF_with_CDS(isoforms_info, corrGTF, corrGTF+'.cds.gff')
File "/export/pipeline/RNASeq/Software/SQANTI2/V3.3/SQANTI2-master/sqanti_qc2.py", line 387, in write_collapsed_GFF_with_CDS
for r in reader:
File "/export/pipeline/RNASeq/Software/cDNA_Cupcake/v8.0/cDNA_Cupcake/cupcake/io/GFF.py", line 393, in next
return self.read()
File "/export/pipeline/RNASeq/Software/cDNA_Cupcake/v8.0/cDNA_Cupcake/cupcake/io/GFF.py", line 550, in read
assert raw[2] == 'transcript'
AssertionError

And in one of the output files "all.collapsed.rep.renamed_corrected.gtf" , the third column only contains "exon", but no "transcript".

Would you please kindly check this error out? Thanks a lot.
Feng

Error when running SQANTI2

Hi There!

Just get an error by : python sqanti_qc2.py -t 30

Here is what script return:
Traceback (most recent call last):
File "sqanti_qc2.py", line 117, in
if os.system(RSCRIPTPATH + " --version")!=0:
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

I already check that I fit all prerequisite. Many thanks in advance.

KeyError in sqanti_qc2.py version 3.5

Hi, Liz:

Thanks for your reply on issue #23 , and it solved my problem. But I ran into another error when running version 3.5. My commands were :

python sqanti_qc2.py --aligner_choice minimap2 -t 20 --geneid all.collapsed.rep.fa genome.gtf genome.fa -o test

The error message was like
output written to all.collapsed.rep.renamed_corrected.fasta
Skipping PB.9452.1 because unmapped.
**** Parsing Isoforms....
Traceback (most recent call last):
File "/export/pipeline/RNASeq/Software/SQANTI2/v3.5/SQANTI2-master/sqanti_qc2.py", line 1753, in
main()
File "/export/pipeline/RNASeq/Software/SQANTI2/v3.5/SQANTI2-master/sqanti_qc2.py", line 1749, in main
run(args)
File "/export/pipeline/RNASeq/Software/SQANTI2/v3.5/SQANTI2-master/sqanti_qc2.py", line 1392, in run
isoforms_info = isoformClassification(args, isoforms_by_chr, refs_1exon_by_chr, refs_exons_by_chr, junctions_by_chr, junctions_by_gene, start_ends_by_gene, genome_dict, indelsJunc, orfDict)
File "/export/pipeline/RNASeq/Software/SQANTI2/v3.5/SQANTI2-master/sqanti_qc2.py", line 1278, in isoformClassification
orfDict[rec.id].cds_genomic_end = m[orfDict[rec.id].cds_end-1] + 1 # make it 1-based
KeyError: 946

But when I changed the input isoforms from fasta format to gtf format, it completed successfully. The commands were:

python sqanti_qc2.py -g --geneid all.collapsed.gff genome.gtf genome.fa -o test

I wonder whether it was because of the code or the input isoform fasta file? Could you give any advices?

Thanks.

How to generate a short read expression file?

I've a question about using RSEM to quantify Iso-Seq collapsed long reads by RNA-seq short reads, then use it as a short read expression file for running SQANTI2. Sorry if here is not the place to discuss it.
As SQANTI paper stated that it can take Iso-Seq + RNA-seq as input. My understanding is to use the Iso-Seq collapsed long transcripts as reference then map short reads against using RSEM (or other tools). According to RSEM, it requires reference sequences using (rsem-prepare-reference command). I tried to run this command to build collapsed long reads as the reference, but it didn't work. For all the options in RSEM preparing reference step, I don't think which one should be used. If anyone can provide the commands or options that will be great.

For the Iso-Seq pipeline, I've GMAP aligned files, ToFu collpased transcripts, and for RNA-seq, paired end reads.

Cheers!

ERROR: --is_fusion

Hello Magdoll,

I have updated the latest version of the SQANTI2. But, I can't still use the '--is_fusion'. I have downloaded it from https://github.com/Magdoll/SQANTI2. Did I download the correct one? Thank you!

sqanti_qc2.py -t 30 ../collapse.collapsed.rep.fq ../gencode.v30.annotation.gtf ../GRCh38.p12.genome.fa --cage_peak ../hg38.cage_peak_phase1and2combined_coord.bed --coverage ../intropolis.v1.hg19_with_liftover_to_hg38.tsv.min_count_10.modified --fl_count ../abundance/collapse.collapsed.abundance.txt --is_fusion ../hq_isoforms.fasta.fusion.rep.fq

R scripting front-end version 3.5.1 (2018-07-02)
usage: sqanti_qc2.py [-h] [--aligner_choice {minimap2,deSALT,gmap}]
[--cage_peak CAGE_PEAK]
[--polyA_motif_list POLYA_MOTIF_LIST]
[--phyloP_bed PHYLOP_BED] [--skipORF] [--is_fusion] [-g]
[-e EXPRESSION] [-x GMAP_INDEX] [-t GMAP_THREADS] [-z]
[-o OUTPUT] [-d DIR] [-c COVERAGE] [-s SITES] [-w WINDOW]
[--geneid] [-fl FL_COUNT] [-v]
isoforms annotation genome
sqanti_qc2.py: error: unrecognized arguments: /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/nalm6e/fusion/hq_isoforms.fasta.fusion.rep.fq

Error in sqanti2

Hi @Magdoll,

I was running sqanti2 for the first time and got an error message saying:

**** Parsing Reference Transcriptome....
.../SQANTI2/utilities/gtfToGenePred: error while loading shared libraries: libpng12.so.0: cannot open shared ob ject file: No such file or directory
Traceback (most recent call last):
File "SQANTI2/sqanti_qc2.py", line 1395, in
main()
File "SQANTI2/sqanti_qc2.py", line 1391, in main
run(args)
File "SQANTI2/sqanti_qc2.py", line 1084, in run
refs_1exon_by_chr, refs_exons_by_chr, junctions_by_chr, junctions_by_gene = reference_parser(args, genome_dict.keys())
File "SQANTI2/sqanti_qc2.py", line 480, in reference_parser
for r in genePredReader(referenceFiles):
File "SQANTI2/sqanti_qc2.py", line 90, in init
self.f = open(filename)
IOError: [Errno 2] No such file or directory: '.../refAnnotation_hq_transcripts.genePred'
(sqanti) lena@pgm-Precision-WorkStation-T7500: ... /SQANTI2/utilities/gtfToGenePred: error while loading shared libraries: libpng12.so.0: cannot open shared ob ject file: No such file or directory
-bash: /SQANTI2/utilities/gtfToGenePred:: No such file or directory

Do I need to provide a reference transcriptome file here??
Thanks for your help!

Lena

Unable to import err_correct_w_genome or sam_to_gff3.py! Please make sure cDNA_Cupcake/sequence/ is in $PYTHONPATH.

Hi @Magdoll ,

I created the environment and installed all the packages as per the tutorial.
I ran the following command:
$ python sqanti_qc2.py --aligner_choice=minimap2 ~/pacbio/testdir/mapped.fa Homo_sapiens.GRCh38.97.gtf HangleeGCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai

and get this msg:

Unable to import err_correct_w_genome or sam_to_gff3.py! Please make sure cDNA_Cupcake/sequence/ is in $PYTHONPATH.

sys.path shows that cDNA_Cupcake/sequence/ is in pythonpath

my Cupcake version:

Last Updated: 06/25/2019
Current version: 7.9

Any suggestions?
the instructions first note that for Iso-Seq output, one can use either FASTA/FASTQ, but the command lines then go on to mention FASTA format only. I took my Collapsed Filtered Isoforms FASTQ file (already mapped) and converted it to fasta format. is that OK?

Thanks

for duplicated read names -> having "dup" on gtf output, but not on SAM/FASTA output

Hi Liz,

Thank you for your help last time - can I ask one more help this time?

I have test SQANTI2, and it seems that when a read("read1") is splitted and mapped to difference sites, SQANTI2 rename these as "read1" and "read1_dup2".

However, it also seems this process changes read names on only gtf and classification output file, but not on FASTA and sam file (I can find "_dup2" and "_dup3" from ".renamed_corrected.gtf" , but not from ".renamed_corrected.fasta" and "".renamed_corrected.sam". Therefore, FASTA and sam file have duplicated reads based on read name. And this troubles many parts of downstream analyses, as many programs identify reads/isoforms only using read names.

Is this what SQANTI2 intended? If not - is there any way to fix this issue easily? Any advice would be really helpful!

Thanks,
Yeji

error in --is_fusion

when i use --is_fudion, get this error.
the fasta file is from fusion_finder.py
python sqanti_qc2.py saf.fusion.rep.fa saf.gff3 genome.fasta --aligner_choice=minimap2 --is_fusion
if i use --is_fusion, it won't run ORFs predicted, but the next step maybe use the *.genePred?

R scripting front-end version 3.5.1 (2018-07-02)
[M::mm_idx_gen::44.284*0.97] collected minimizers
[M::mm_idx_gen::75.330*0.98] sorted minimizers
[M::main::75.345*0.98] loaded/built the index for 12 target sequence(s)
[M::mm_mapopt_update::77.569*0.98] mid_occ = 905
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 12
[M::mm_idx_stat::79.243*0.98] distinct minimizers: 79003367 (55.73% are singletons); average occurrences: 4.578; average spacing: 2.923
[M::worker_pipeline::81.420*0.98] mapped 99 sequences
[M::main] Version: 2.17-r943-dirty
[M::main] CMD: minimap2 -ax splice --secondary=no -C5 -uf -t 1 /public/home/genome/genome.fasta /public/home/saf.fusion.rep.renamed.fasta
[M::main] Real time: 81.481 sec; CPU: 79.892 sec; Peak RSS: 7.007 GB
/public/home/saf.gff3 doesn't appear to be a GTF file (GFF not supported by this program)
WARNING: Currently if --is_fusion is used, no ORFs will be predicted.
Cleaning up isoform IDs...
Cleaned up isoform fasta file written to: /public/home/saf.fusion.rep.renamed.fasta
output written to 
WARNING: Skipping ORF prediction because user requested it. All isoforms will be non-coding!
WARNING: All input isoforms were predicted as non-coding
Traceback (most recent call last):
  File "/public/home/software/SQANTI2-master/sqanti_qc2.py", line 1794, in <module>
    main()
  File "/public/home/software/SQANTI2-master/sqanti_qc2.py", line 1790, in main
    run(args)
  File "/public/home/software/SQANTI2-master/sqanti_qc2.py", line 1418, in run
    refs_1exon_by_chr, refs_exons_by_chr, junctions_by_chr, junctions_by_gene, start_ends_by_gene = reference_parser(args, list(genome_dict.keys()))
  File "/public/home/software/SQANTI2-master/sqanti_qc2.py", line 625, in reference_parser
    for r in genePredReader(referenceFiles):
  File "/public/home/zcyu/software/SQANTI2-master/sqanti_qc2.py", line 124, in __init__
    self.f = open(filename)
FileNotFoundError: [Errno 2] No such file or directory: '/public/home/code/refAnnotation_saf.fusion.rep.genePred'

Caution: long input filenames create unrelated error messages

Hi Elizabeth,

When using sqanti_qc2.py, if the filename of the input fasta file is sufficiently long (and possibly including the path too), GeneMark will throw an error.

Error on open output file /mnt/data/IsoSeq_20190729/GMST/GMST_tmp
GeneMarkS: error on last system call, error code 256
Abort program!!!

Examination of the GeneMark log file suggests that gmst.pl is happy, but probuild is not:

/mnt/data/IsoSeq_20190729/SQANTI2/utilities/gmst/probuild --par /mnt/data/IsoSeq_20190729/SQANTI2/utilities/gmst/par_1.default --clean_join sequence --seq /mnt/data/IsoSeq_20190729/m54178_190723_19024
0.ccs.demux.primer_5p--primer_3p_flnc_polished_CUPCAKE.collapsed.rep.renamed_corrected.fasta --log gms.log
Error on last system call, error code 256
Abort program!!!

Direct invocation of probuild yields the following uninformative error:

error in command line

This feature is clearly related to GeneMark and not SQANTI2. Nevertheless, keeping the length of input filenames to a minimum solves this issue.

Sincerely,
Mark

Intropolis for hg38 genome, modified into STAR junction format.

Hi @Magdoll,

may will be possible to download somewhere a "SQANTI2 ready-to-use" file (in STAR junction format) corresponding to Intropolis for hg38 and hg19 genome ? Or at least include a script that permits to make the conversion in STAR junction format, please ?

In fact, it seems that none of the files downloadable in the Intropolis github are in the good format for SQANTI2, or maybe I am wrong ?

Thank you very much in advance for your answer

Best regards

ERROR: unrecognized arguments: --is_fusion

Hello Magdoll,

I have gotten an error related to the fusion transcript option (--is_fusion). I attached my log at below. Do I need to update the version? Please let me know how I can resolve this issue. Thank you!

R scripting front-end version 3.5.1 (2018-07-02)
usage: sqanti_qc2.py [-h] [--aligner_choice {minimap2,deSALT,gmap}]
[--cage_peak CAGE_PEAK]
[--polyA_motif_list POLYA_MOTIF_LIST]
[--phyloP_bed PHYLOP_BED] [--skipORF] [-g]
[-e EXPRESSION] [-x GMAP_INDEX] [-t GMAP_THREADS] [-z]
[-o OUTPUT] [-d DIR] [-c COVERAGE] [-s SITES] [-w WINDOW]
[--geneid] [-fl FL_COUNT] [-v]
isoforms annotation genome
sqanti_qc2.py: error: unrecognized arguments: --is_fusion /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/nalm6e/fusion/hq_isoforms.fasta.fusion.abundance.txt

Error in Running sqanti_qc2.py

Hi @Magdoll

I was trying to run sqanti_qc2.py and came across the following error

File "sqanti_qc2.py", line 43, in
from sam_to_gff3 import convert_sam_to_gff3
File "/u/home/a/Resource/cDNA_Cupcake/sequence/sam_to_gff3.py", line 85
parser.add_argument("-s", "--source", required=True, help="source name (ex: hg38, mm10)")
^
IndentationError: unexpected indent

cDNA_Cupcake/sequence is in my PATH. How can I resolve it?

SQANTI qc is very slow

I'm running the entire IsoSeq pipeline on some PacBio CCS reads (mRNA, average CCS length ~2kb). For now I'm running on a test dataset of 1000 reads. My pipeline has all the recommended steps (lima, isoseq3 refine, minimap2, etc). Of these, the sqanti2_qc step is the slowest by far and seems to take 50x longer than all other steps put together. Particularly, I'm giving it 8 cores (-t 8) but it only ever seems to use 1 core. Is this normal or is there a parallelization option I'm missing?

You can see my pipeline at https://github.com/gmstanle/nf-core-scisoseq.

Collapsed header in fasta file

Hello Magdoll,

My collapsed header after Iso-Seq3 and minimap2 is slightly different to your example. Is it ok to run SQANTI? Thank you!

PB.1.1|chr1:184917-199875(-)|transcript/19866 transcript/19866 full_length_coverage=28;length=1754;num_subreads=60
PB.2.1|chr1:827670-843589(+)|transcript/11029 transcript/11029 full_length_coverage=2;length=2604;num_subreads=54
PB.3.1|chr1:944203-959277(-)|transcript/15184 transcript/15184 full_length_coverage=2;length=2218;num_subreads=22
.
.
.

magdoll / sqanti2 Goto Github PK

sqanti2's People

Contributors

Stargazers

Watchers

Forkers

sqanti2's Issues

I have gotten an error related to the fusion transcript option (--is_fusion). I attached my log at below. Do I need to update the version? Please let me know how I can resolve this issue. Thank you!

Recommend Projects

Recommend Topics

Recommend Org

Jobs