magdoll / sqanti2 Goto Github PK
View Code? Open in Web Editor NEWSQANTI2 is now replaced by SQANTI3. Please go to: https://github.com/ConesaLab/SQANTI3
License: Other
SQANTI2 is now replaced by SQANTI3. Please go to: https://github.com/ConesaLab/SQANTI3
License: Other
Hi Elizabeth,
I had an error (subprocess.CalledProcessError) with SQANTI2 (v2.8) when I am in "Predicting ORF" step, and it is strange that this error just happened in one of my samples (other data could get the classification result).
Here is the log:
Cleaning up isoform IDs...
Cleaned up isoform fasta file written to: /DATA/test_OC_1
_2/collapse/OC_1_2.collapsed.rep.renamed.fasta
Error corrected FASTA /DATA/test_OC_1_2/annotation/OC_1_2
.collapsed.rep.renamed_corrected.fasta already exists. Using it...
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = (unset),
LANG = "en_US.utf8"
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
Use of uninitialized value in open at /DATA/software/SQANTI2/utilities/gmst/gmst.pl line 885, <$FA> line 94.
**** Running SQANTI...
**** Parsing provided files....
Reading genome fasta /DATA/database/mm10/reference/mm10.sort.fa....
**** Predicting ORF sequences...
Traceback (most recent call last):
File "/DATA/software/SQANTI2/sqanti_qc2.py", line 1599, in <module>
main()
File "/DATA/software/SQANTI2/sqanti_qc2.py", line 1595, in main
run(args)
File "/DATA/software/SQANTI2/sqanti_qc2.py", line 1248, in run
orfDict = correctionPlusORFpred(args, genome_dict)
File "/DATA/software/SQANTI2/sqanti_qc2.py", line 480, in correctionPlusORF
pred
if subprocess.check_call(cmd, shell=True, cwd=gmst_dir)!=0:
File "/DATA/anaconda2/envs/rna/lib/python2.7/subprocess.py", line 190, in c
heck_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'perl /DATA/software/SQANTI2/utilities
/gmst/gmst.pl -faa --strand direct --fnn --output /DATA/t
est_OC_1_2/annotation/GMST/GMST_tmp /DATA/test_OC_1_2/annotation/OC_1_2.collapsed.rep.renamed_corrected.fasta' returned non-zero exit status 2
Thanks,
Y.Zhang
Hello,
The sqanti_qc2.py is working, but sqanti_filter2.py isn't as below. It seems to be the python script issue? I will appreciate it if you will help me to fix the problem. Thank you!
Taehee
$ sqanti_filter2.py
/opt/SQANTI2/20190822/envs/anaCogent3/bin/sqanti_filter2.py: line 1: author: command not found
/opt/SQANTI2/20190822/envs/anaCogent3/bin/sqanti_filter2.py: line 2: version: command not found
/opt/SQANTI2/20190822/envs/anaCogent3/bin/sqanti_filter2.py: line 12:
Lightweight filtering of SQANTI by using .classification.txt output
Only keep Iso-Seq isoforms if:
The isoform is FSM, ISM, or NIC and (does not have intrapriming or has polyA_motif)
The isoform is NNC, does not have intrapriming/or polyA motif, not RT-switching, and all junctions are either all canonical or short-read-supported
The isoform is antisense, intergenic, genic, does not have intrapriming/or polyA motif, not RT-switching, and all junctions are either all canonical or short-read-supported
: No such file or directory
import: unable to open X server ' @ error/import.c/ImportImageCommand/369. import: unable to open X server
' @ error/import.c/ImportImageCommand/369.
/opt/SQANTI2/20190822/envs/anaCogent3/bin/sqanti_filter2.py: line 16: from: command not found
/opt/SQANTI2/20190822/envs/anaCogent3/bin/sqanti_filter2.py: line 17: from: command not found
/opt/SQANTI2/20190822/envs/anaCogent3/bin/sqanti_filter2.py: line 18: from: command not found
/opt/SQANTI2/20190822/envs/anaCogent3/bin/sqanti_filter2.py: line 19: from: command not found
/opt/SQANTI2/20190822/envs/anaCogent3/bin/sqanti_filter2.py: line 21: syntax error near unexpected token (' /opt/SQANTI2/20190822/envs/anaCogent3/bin/sqanti_filter2.py: line 21:
utilitiesPath = os.path.dirname(os.path.realpath(file))+"/utilities/"'
Dear @Magdoll
Hello I ran your updated version of SQANTI2 with script below
python /appl/sqanti2_2/SQANTI2/sqanti_qc2.py -t 30 -c illumina/PM-AU-0002-N-A1SJ.out.tab 20180817_colon_2N_Nanoflit_q7_pychopper_2.fasta /data/ONT_RNA/reference/Homo_sapiens.GRCh38.93.gtf /data/ONT_RNA/reference/hg38.fa
and the script stopped running due to the following reason
Error in `$<-.data.frame`(`*tmp*`, SJ_type, value = "__SJ") :
replacement has 1 row, data has 0
Calls: $<- -> $<-.data.frame
Execution halted
Traceback (most recent call last):
File "/appl/sqanti2_2/SQANTI2/sqanti_qc2.py", line 1515, in <module>
main()
File "/appl/sqanti2_2/SQANTI2/sqanti_qc2.py", line 1511, in main
run(args)
File "/appl/sqanti2_2/SQANTI2/sqanti_qc2.py", line 1346, in run
if subprocess.check_call(cmd, shell=True)!=0:
File "/usr/local/lib/python2.7/subprocess.py", line 190, in check_call
raise CalledProcessError(retcode, cmd)
Script has ran just fine without the -c (illumina SJ file) option, but with this result I'm having this trouble.
Thank you very much for your help!
Jungwoo
I know the current version of SQANTI2 does not support BAM as input. It uses FASTQ files and runs the alignment. If I want to do some post-alignment filtering before SQANTI2, what I need to do is 1) align FLNC reads on my own, 2) do post-alignment filtering, 3) convert bam to FASTQ, 4) run SQANTI2 which comes with an additional round of alignment using filtered FASTQ.
This is more issue since I use FLNC reads rather clustered reads and I need more post-processing for FLNC reads. For my research, the number of PacBio reads matters to assess the expression of isoforms. Moreover, the clustering procedure discards singleton transcripts which is the evidence of expression of rare isoforms.
Would it be possible to use BAM as input for SQANTI2?
Hi:
[luping@centos split-F-gene]$ python /disk/luping/tools/SQANTI2-master/sqanti_qc2.py -t 15 -g ISO-fusion.collapsed.gtf Fusarium_graminearum.RR1.41.chr.gtf ph1.fasta
R scripting front-end version 3.5.1 (2018-07-02)
Cleaning up isoform IDs...
Cleaned up isoform fasta file written to: /disk/luping/bam/iso-bam/pb-cluster-hqlq-total/two-condition/split-F-gene/ISO-fusion.collapsed.renamed.fasta
Traceback (most recent call last):
File "/disk/luping/tools/SQANTI2-master/sqanti_qc2.py", line 1395, in <module>
main()
File "/disk/luping/tools/SQANTI2-master/sqanti_qc2.py", line 1384, in main
if args.aligner_choice == "minimap2":
AttributeError: 'Namespace' object has no attribute 'aligner_choice'
[luping@centos split-F-gene]$ minimap2 --version
2.14-r894-dirty
What's wrong with this?
Hi,
This pipeline looks great, and I am wanting to try and run it for my data which is from direct RNA with Nanopore. Do you think this is possible? I tried running it with a fasta file but got an error about the read ID.
Thank you for your help.
Hi Elizabeth,
I am running sqanti_qc2.py on version 4.0, use your example data to test and my commands are like this:
python sqanti_qc2.py -t 6 touse.rep.fasta gencode.v31.annotation.gtf hg38.genome.fa --cage_peak hg38.cage_peak_phase1and2combined_coord.bed
In my STDERR, I get this Error :
R scripting front-end version 3.4.4 (2018-03-15)
Cleaning up isoform IDs...
Cleaned up isoform fasta file written to: /home/hcd_lab/software/SQANTI2-master/example/touse.rep.renamed.fasta
[M::mm_idx_gen::42.9061.75] collected minimizers
[M::mm_idx_gen::54.0572.60] sorted minimizers
[M::main::54.0572.60] loaded/built the index for 25 target sequence(s)
[M::mm_mapopt_update::56.5132.53] mid_occ = 763
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 25
[M::mm_idx_stat::57.9332.49] distinct minimizers: 167178949 (35.44% are singletons); average occurrences: 6.015; average spacing: 3.071
[M::worker_pipeline::65.1642.88] mapped 4730 sequences
[M::main] Version: 2.17-r954-dirty
[M::main] CMD: minimap2 -ax splice --secondary=no -C5 -uf -t 6 /media/huyueming/disk1/data/hg38/hg38.genome.fa /home/hcd_lab/software/SQANTI2-master/example/touse.rep.renamed.fasta
[M::main] Real time: 65.291 sec; CPU: 187.598 sec; Peak RSS: 18.696 GB
output written to /home/hcd_lab/software/SQANTI2-master/example/touse.rep.renamed_corrected.fasta
**** Parsing Isoforms....
**** Running SQANTI...
**** Parsing provided files....
Reading genome fasta /media/huyueming/disk1/data/hg38/hg38.genome.fa....
****Aligning reads with Minimap2...
**** Predicting ORF sequences...
**** Parsing Reference Transcriptome....
Splice Junction Coverage files not provided.
**** Reading CAGE Peak data.
**** Performing Classification of Isoforms....
Number of classified isoforms: 4730
Traceback (most recent call last):
File "/home/hcd_lab/software/SQANTI2-master/sqanti_qc2.py", line 1762, in
main()
File "/home/hcd_lab/software/SQANTI2-master/sqanti_qc2.py", line 1758, in main
run(args)
File "/home/hcd_lab/software/SQANTI2-master/sqanti_qc2.py", line 1405, in run
write_collapsed_GFF_with_CDS(isoforms_info, corrGTF, corrGTF+'.cds.gff')
File "/home/hcd_lab/software/SQANTI2-master/sqanti_qc2.py", line 399, in write_collapsed_GFF_with_CDS
for r in reader:
File "/home/hcd_lab/software/anaconda2/lib/python2.7/site-packages/cupcake-8.5-py2.7-linux-x86_64.egg/cupcake/io/GFF.py", line 393, in next
return self.read()
File "/home/hcd_lab/software/anaconda2/lib/python2.7/site-packages/cupcake-8.5-py2.7-linux-x86_64.egg/cupcake/io/GFF.py", line 550, in read
assert raw[2] == 'transcript'
AssertionError
Hi @Magdoll,
I have run my Iso-Seq data through SQANTI and I noticed that some of the transcripts that are classified as "intergenic" are multi-exon transcripts that overlap a mono-exon transcript in the reference annotation. Based on the code, I can see why this happens since multi-exon transcripts are only compared to multi-exon transcripts from the reference. I'm not sure that classifying these transcripts as "intergenic" is logical though. Perhaps they would better fit under NNC? Or an additional label?
Hello @Magdoll ,
I've tried the newest version of SQANTI2 (v5.1.0) and saw the following log:
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -ax splice -uf -k14 -C5 --secondary=no -t 50 hg38_chr.fa file1.fasta
[M::main] Real time: 1452.274 sec; CPU: 68896.896 sec; Peak RSS: 54.618 GB
Error: invalid feature coordinates (end<start!) at line:
chr5 hg38_chr exon 40824922 40824921 . - . ID=94a3dcb5-de7c-4d08-b571-e2d0c5533afa_dup2.exon2;Name=94a3dcb5-de7c-4d08-b571-e2d0c5533afa_dup2.exon2;Parent=94a3dcb5-de7c-4d08-b571-e2d0c5533afa_dup2
Error: invalid feature coordinates (end<start!) at line:
chr2 hg38_chr exon 88935909 88935908 . - . ID=65ccca1d-b8ee-40ef-9b7a-2a617adf3cfe.exon2;Name=65ccca1d-b8ee-40ef-9b7a-2a617adf3cfe.exon2;Parent=65ccca1d-b8ee-40ef-9b7a-2a617adf3cfe
I've modified the minimap2 option for Nanopore direct RNA, but didn't change anything. there are only 4 lines saying "invalid feature coordinates", but it bothers me nonetheless. Could you help me on understanding what's going on? Thanks!
Posting the original issue here first....
I've mapped some isoseq data and then collapsed it with TAMA and am now trying to do the final processing in SQANTI2 - I can't get past this error:
(/opt/sqanti2/4.1/anaCogent3) [rwr002@node35 ISOSEQ_gmap_tama]$ python /opt/sqanti2/4.1/SQANTI2-4.1/sqanti_qc2.py --aligner_choice gmap -x genomes/GRCh38 -t 18 -o FAM43 -d TESTOUT fam43_isoseq_out.fastq.split.tama.renamed.fasta genomes/gencode.v32.annotation.gtf genomes/hg38.fa
R scripting front-end version 3.2.3 (2015-12-10)
Cleaning up isoform IDs...
Cleaned up isoform fasta file written to: /gpfs0/home/gdlauberlab/rwr002/TESTING/ISOSEQ_gmap_tama/fam43_isoseq_out.fastq.split.tama.renamed.renamed.fasta
**** Running SQANTI...
**** Parsing provided files....
Reading genome fasta /gpfs0/home/gdlauberlab/rwr002/TESTING/ISOSEQ_gmap_tama/genomes/hg38.fa....
Aligned SAM /gpfs0/home/gdlauberlab/rwr002/TESTING/ISOSEQ_gmap_tama/TESTOUT/fam43_isoseq_out.fastq.split.tama.renamed.renamed_corrected.sam already exists. Using it...
output written to /gpfs0/home/gdlauberlab/rwr002/TESTING/ISOSEQ_gmap_tama/TESTOUT/fam43_isoseq_out.fastq.split.tama.renamed.renamed_corrected.fasta
**** Predicting ORF sequences...
terminate called after throwing an instance of 'std::length_error'
what(): basic_string::_S_create
GeneMarkS: error on last system call, error code 134
Abort program!!!
Traceback (most recent call last):
File "/opt/sqanti2/4.1/SQANTI2-4.1/sqanti_qc2.py", line 1793, in <module>
main()
File "/opt/sqanti2/4.1/SQANTI2-4.1/sqanti_qc2.py", line 1789, in main
run(args)
File "/opt/sqanti2/4.1/SQANTI2-4.1/sqanti_qc2.py", line 1414, in run
orfDict = correctionPlusORFpred(args, genome_dict)
File "/opt/sqanti2/4.1/SQANTI2-4.1/sqanti_qc2.py", line 555, in correctionPlusORFpred
if subprocess.check_call(cmd, shell=True, cwd=gmst_dir)!=0:
File "/opt/sqanti2/4.1/anaCogent3/lib/python2.7/subprocess.py", line 190, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'perl /gpfs0/export/opt/sqanti2/4.1/SQANTI2-4.1/utilities/gmst/gmst.pl -faa --strand direct --fnn --output /gpfs0/home/gdlauberlab/rwr002/TESTING/ISOSEQ_gmap_tama/TESTOUT/GMST/GMST_tmp /gpfs0/home/gdlauberlab/rwr002/TESTING/ISOSEQ_gmap_tama/TESTOUT/fam43_isoseq_out.fastq.split.tama.renamed.renamed_corrected.fasta' returned non-zero exit status 1
I've done SQANTI2 on PacBio Sequal high-quality collapsed isoforms from Iso-Seq + Tofu. I found more than 50% of the isoforms are classified as "antisense" and "NIC". I used GMAP to map against Ensembl mouse genome.
I also tried proovread to correct high-quality isoforms by short reads, and from SQANTI2 report, number of "NIC" dropped and "ISM" increased but "antisense" still around 25%.
Is it normal to see half of isoforms as "antisense" and "NIC"? Or is it because not having enough coverage?
Thanks!
I'm in the middle of prepping a script to perform this analysis, and I'd forgotten to enable the gmap module on my system - so sqanti kind of ran but didn't. Subsequently, it continued to fail until I deleted the files that were created during the aborted attempt - apparently it doesn't like having pre-existing files:
**** Predicting ORF sequences...
terminate called after throwing an instance of 'std::length_error'
what(): basic_string::_S_create
GeneMarkS: error on last system call, error code 134
Abort program!!!
Traceback (most recent call last):
File "/opt/sqanti2/4.1/SQANTI2-4.1/sqanti_qc2.py", line 1793, in
main()
File "/opt/sqanti2/4.1/SQANTI2-4.1/sqanti_qc2.py", line 1789, in main
run(args)
File "/opt/sqanti2/4.1/SQANTI2-4.1/sqanti_qc2.py", line 1414, in run
orfDict = correctionPlusORFpred(args, genome_dict)
File "/opt/sqanti2/4.1/SQANTI2-4.1/sqanti_qc2.py", line 555, in correctionPlusORFpred
if subprocess.check_call(cmd, shell=True, cwd=gmst_dir)!=0:
File "/opt/sqanti2/4.1/anaCogent3/lib/python2.7/subprocess.py", line 190, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'perl /gpfs0/export/opt/sqanti2/4.1/SQANTI2-4.1/utilities/gmst/gmst.pl -faa --strand direct --fnn --output /gpfs0/home/gdlauberlab/rwr002/TESTING/ISOSEQ_2/GMST/GMST_tmp /gpfs0/home/gdlauberlab/rwr002/TESTING/ISOSEQ_2/cupcake.gmap.collapsed.rep.renamed_corrected.fasta' returned non-zero exit status 1
Hi Magdoll,
I ran your updated version of SQANTI2 v3.2 with script below
python ../../../SQANTI2/sqanti_qc2.py
-t 50
test.fasta
../../../references/gencode.v31.annotation.gtf.gz
../../../references/hg38_noALT.fa
and encountered the following errors -
**** Performing Classification of Isoforms....
Traceback (most recent call last):
File "../../../SQANTI2/sqanti_qc2.py", line 1677, in
main()
File "../../../SQANTI2/sqanti_qc2.py", line 1673, in main
run(args)
File "../../../SQANTI2/sqanti_qc2.py", line 1325, in run
isoforms_info = isoformClassification(args, isoforms_by_chr, refs_1exon_by_chr, refs_exons_by_chr, junctions_by_chr, junctions_by_gene, start_ends_by_gene, genome_dict, indelsJunc, orfDict)
File "../../../SQANTI2/sqanti_qc2.py", line 1225, in isoformClassification
dist_to_last_junc = rec.junctions[0][1] - orfDict[rec.id].cds_genomic_end
TypeError: unsupported operand type(s) for -: 'int' and 'NoneType'
Do you have any suggestions to fix the error?
The same command work fine previously on v2.8.
Hi @Magdoll,
I met some problems when running sqanti2, and need your help to figure them out.
First, when running sqanti_qc2.py, I got the below error message:
R scripting front-end version 3.5.2 (2018-12-20)
Cleaning up isoform IDs...
Cleaned up isoform fasta file written to: /data/daij/project/RTEL1_Sharon/CCS/hg38/splicing/sqanti/sqanti2/new/fasta/ccs.NP0328-157_3078_RTEL1_lbc65.lbc65--lbc65.renamed.fasta
[M::mm_idx_gen::123.3301.73] collected minimizers
[M::mm_idx_gen::149.0952.55] sorted minimizers
[M::main::149.0952.55] loaded/built the index for 195 target sequence(s)
[M::mm_mapopt_update::154.7712.49] mid_occ = 751
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 195
[M::mm_idx_stat::157.5852.47] distinct minimizers: 167240184 (35.46% are singletons); average occurrences: 6.007; average spacing: 3.086
[M::worker_pipeline::223.7864.05] mapped 15901 sequences
[M::main] Version: 2.15-r905
[M::main] CMD: minimap2 -ax splice --secondary=no -C5 -uf -t 8 /fdb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa /data/daij/project/RTEL1_Sharon/CCS/hg38/splicing/sqanti/sqanti2/new/fasta/ccs.NP0328-157_3078_RTEL1_lbc65.lbc65--lbc65.renamed.fasta
[M::main] Real time: 224.164 sec; CPU: 906.497 sec; Peak RSS: 18.434 GB
output written to /gpfs/gsfs8/users/daij/project/RTEL1_Sharon/CCS/hg38/splicing/sqanti/sqanti2/new/ccs.NP0328-157_3078_RTEL1_lbc65.lbc65--lbc65.renamed_corrected.fasta
Skipping m54137_181210_213828/5308570/ccs because unmapped.
...
Skipping m54137_181210_213828/56033863/ccs because unmapped.
error in command line
GeneMarkS: error on last system call, error code 256
Abort program!!!
Aligner choice: Minimap2.
**** Running SQANTI...
**** Parsing provided files....
Reading genome fasta /fdb/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa....
****Aligning reads with Minimap2...
**** Predicting ORF sequences...
Traceback (most recent call last):
File "/data/daij/SQANTI2/sqanti_qc2.py", line 1395, in
main()
File "/data/daij/SQANTI2/sqanti_qc2.py", line 1391, in main
run(args)
File "/data/daij/SQANTI2/sqanti_qc2.py", line 1081, in run
orfDict = correctionPlusORFpred(args, genome_dict)
File "/data/daij/SQANTI2/sqanti_qc2.py", line 412, in correctionPlusORFpred
if subprocess.check_call(cmd, shell=True, cwd=gmst_dir)!=0:
File "/data/daij/miniconda2/envs/anaCogent5.2/lib/python2.7/subprocess.py", line 190, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'perl /gpfs/gsfs8/users/daij/SQANTI2/utilities/gmst/gmst.pl -faa --strand direct --fnn --output /gpfs/gsfs8/users/daij/project/RTEL1_Sharon/CCS/hg38/splicing/sqanti/sqanti2/new/GMST/GMST_tmp /gpfs/gsfs8/users/daij/project/RTEL1_Sharon/CCS/hg38/splicing/sqanti/sqanti2/new/ccs.NP0328-157_3078_RTEL1_lbc65.lbc65--lbc65.renamed_corrected.fasta' returned non-zero exit status 1
Second, when running sqanti_filter2.py, I got another error message:
Running SQANTI filtering...
Traceback (most recent call last):
File "/data/daij/SQANTI2/sqanti_filter2.py", line 119, in
main()
File "/data/daij/SQANTI2/sqanti_filter2.py", line 114, in main
sqanti_filter_lite(args)
File "/data/daij/SQANTI2/sqanti_filter2.py", line 55, in sqanti_filter_lite
cat = CATEGORY_DICT[r['structural_category']]
KeyError: 'fusion'
Do you have any idea about that?
Thanks a lot,
Jieqiong
Hi Liz,
Sorry for my misunderstanding, which input isoforms.fasta is needed when using a multi-sample FL Count file produced by the chain_samples.py?
I generated a multisample FL count from 10 multiplexed tissues. I have collapsed fasta files for each demultiplexed sample. Shall I generate somehow a merged fasta for all 10-plex as input isoforms for SQANTI2 to be analyzed with the multi-sample FL count data?
Thank you.
Pablo
Hey Liz,
I'm running your latest version of sqanti2 (v2.6, and cupcake at v6.8) and I'm getting an error when the script calls gmap to align the isoforms (in bold below). It looks like the gmap arguments are out of order: the gmap directory is being passed to the gmap command as "--cross-species" instead of as the path I provided in the -x parameter. The script I ran was:
python /SQANTI2/sqanti_qc2.py
--aligner_choice=gmap
--cage_peak /hg38.cage_peak_phase1and2combined_coord.bed
--polyA_motif_list /human_polyA_list.txt
-x /GRCh38.p12 -t 24 -z
-o BS6.isoseq3_polished.hq.uniq.sorted.collapsed.filtered.rep -d
-c /intropolis.v1.hg19_with_liftover_to_hg38.tsv
-fl /BS6.isoseq3_polished.hq.uniq.sorted.collapsed.filtered.abundance.txt
/BS6.isoseq3_polished.all.uniq.sorted.collapsed.filtered.rep.fa
/gencode.v30.chr_patch_hapl_scaff.annotation.gtf
/GRCh38.p12.genome.fa
Thanks in advance for checking this out!
Best,
Nancy
####################################################################
R scripting front-end version 3.3.1 (2016-06-21)
Cleaning up isoform IDs...
Cleaned up isoform fasta file written to: /BS6.isoseq3_polished.hq.uniq.sorted.collapsed.filtered.rep.renamed.fasta
GMAP version 2019-02-15 called with args: gmap.sse42 -D --cross-species -n 1 --max-intronlength-middle=2000000 --max-intronlength-ends=2000000 -L 3000000 -f samse -t 24 <gmap_path> -d GRCh38.p12 -z sense_force /BS6.isoseq3_polished.hq.uniq.sorted.collapsed.filtered.rep.renamed.fasta
Note: -n 1 will not report chimeric alignments. If you want a single alignment plus chimeras, use -n 0 instead.
Checking compiler assumptions for SSE2: 6B8B4567 327B23C6 xor=59F066A1
Checking compiler assumptions for SSE4.1: -103 -58 max=198 => compiler zero extends
Checking compiler options for SSE4.2: 6B8B4567 __builtin_clz=1 __builtin_ctz=0 _mm_popcnt_u32=17 __builtin_popcount=17
Finished checking compiler assumptions
Unable to find genome directory --cross-species
Either recompile the GMAP package to have the correct default directory (seen by doing gmap --version),
or use the -D flag to gmap to specify the correct genome directory.
Hello, I have an installation problem.Any help will be appreciated.
I installed it according to the process,but when I type python sqanti_qc2.py , terminal display a error that python: can't open file 'sqanti_qc2.py': [Errno 2] No such file or directory
after type python setup.py install ,the terminal display:
Installing out_to_chain.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing bed_count_overlapping.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing wiggle_to_array_tree.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing tfloc_summary.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing maf_interval_alignibility.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing maf_thread_for_species.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing maf_print_scores.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing prefix_lines.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing maf_col_counts.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing interval_join.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing maf_extract_chrom_ranges.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing bed_bigwig_profile.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing bed_subtract_basewise.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing maf_limit_to_species.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing bed_diff_basewise_summary.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing bed_coverage.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing maf_to_axt.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Installing line_select.py script to /home/dsy/anaconda3/envs/sqanti2/bin
Using /home/dsy/anaconda3/envs/sqanti2/lib/python3.7/site-packages/bx_python-0.8.6-py3.7-linux-x86_64.egg
Searching for biopython==1.76
Best match: biopython 1.76
Adding biopython 1.76 to easy-install.pth file
Using /home/dsy/anaconda3/envs/sqanti2/lib/python3.7/site-packages
Searching for scikit-learn==0.22.1
Best match: scikit-learn 0.22.1
Processing scikit_learn-0.22.1-py3.7-linux-x86_64.egg
scikit-learn 0.22.1 is already the active version in easy-install.pth
Using /home/dsy/anaconda3/envs/sqanti2/lib/python3.7/site-packages/scikit_learn-0.22.1-py3.7-linux-x86_64.egg
Searching for six==1.13.0
Best match: six 1.13.0
Adding six 1.13.0 to easy-install.pth file
Using /home/dsy/anaconda3/envs/sqanti2/lib/python3.7/site-packages
Searching for scipy==1.4.1
Best match: scipy 1.4.1
Processing scipy-1.4.1-py3.7-linux-x86_64.egg
scipy 1.4.1 is already the active version in easy-install.pth
Using /home/dsy/anaconda3/envs/sqanti2/lib/python3.7/site-packages/scipy-1.4.1-py3.7-linux-x86_64.egg
Searching for joblib==0.14.1
Best match: joblib 0.14.1
Processing joblib-0.14.1-py3.7.egg
joblib 0.14.1 is already the active version in easy-install.pth
Using /home/dsy/anaconda3/envs/sqanti2/lib/python3.7/site-packages/joblib-0.14.1-py3.7.egg
Finished processing dependencies for cupcake==9.1.1
Does this display indicate that the installation is successful?
Thank you for your help!
Support SAM/BAM input for SQANTI2
When running squanti2 with the -g option (i.e. providing a gtf rather than an input fasta file), the program fails with a namespace error.
Hi @Magdoll,
I ran Sqanti2 and I have some questions about the results.
First, I noticed that a total of 5475 sequences got added to the .renamed_corrected.fasta file. In the classification output file, they have the suffix _dup2 or _dup3. I tried aligning some of these sequences (for example transcript/22 and transcript/22_dup2), but there seems to be no significant similarity between those sequences. Could you give me some clarification about where these sequences came from?
Also, on your page, you stated that field 27 (FSM_class) from the classification output file should be ignored. Does this mean that the following fields (28: ORF_length, 29: CDS_length, 30:
CDS_start and 31: CDS_end) should also be ignored?
Melissa
Hello, I am not entirely sure why I am getting the error. Any help will be appreciated.
Sqanti.pbs
#!/bin/bash
#PBS -P BLRseq
#PBS -N sqanti
#PBS -l select=1:ncpus=8:mem=64GB
#PBS -l walltime=20:00:00
#PBS -e /scratch/BLRseq/BLR/results/annotation/structural/BLR560/isoseq/sqanti/sqanti.err
#PBS -o /scratch/BLRseq/BLR/results/annotation/structural/BLR560/isoseq/sqanti/sqanti.out
wdir=/scratch/BLRseq/BLR
outdir=$wdir/results/annotation/structural/BLR560/isoseq/sqanti
ref=$wdir/results/annotation/structural/BLR560/ref.fa
cds=$wdir/data/BLR560/isoseq/SMRT/cluster/hq_isoforms.fasta
gtf=$outdir/augustus.ab_initio.gtf
sqanti=/scratch/BLRseq/bin/SQANTI2
cupcake=/scratch/BLRseq/bin/cDNA_Cupcake
module load python
module load cufflinks/2.2.1
module load ucsc-userapps/348
module load R/3.3.2
module load minimap2/2.3
module load samtools/1.9
export PYTHONPATH=$PYTHONPATH:$cupcake/sequence
minimap2 -ax splice -uf $ref $cds > $outdir/aln.sam
samtools sort -O sam -o $outdir/aln.sorted.sam -@ 8 $outdir/aln.sam
python $cupcake/cupcake/tofu/collapse_isoforms_by_sam.py --input $cds -s $outdir/aln.sorted.sam -o $outdir/hq --dun-merge-5-shorter
python $sqanti/sqanti_qc2.py --aligner_choice minimap2 -t 8 --output BLR560 --dir $outdir $outdir/hq.collapsed.rep.fa $gtf $ref
$ tail -20 sqanti.err
[M::worker_pipeline::8.938*3.74] mapped 8843 sequences
[M::main] Version: 2.3-r545-dirty
[M::main] CMD: minimap2 -ax splice --secondary=no -C5 -uf -t 8 /scratch/BLRseq/BLR/results/annotation/structural/BLR560/ref.fa /scratch/BLRseq/BLR/results/annotation/structural/BLR560/isoseq/sqanti/hq.collapsed.rep.renamed.fasta
[M::main] Real time: 8.965 sec; CPU: 33.463 sec
output written to /scratch/BLRseq/BLR/results/annotation/structural/BLR560/isoseq/sqanti/hq.collapsed.rep.renamed_corrected.fasta
**** Parsing Isoforms....
Traceback (most recent call last):
File "/scratch/BLRseq/bin/SQANTI2/sqanti_qc2.py", line 1762, in <module>
main()
File "/scratch/BLRseq/bin/SQANTI2/sqanti_qc2.py", line 1758, in main
run(args)
File "/scratch/BLRseq/bin/SQANTI2/sqanti_qc2.py", line 1405, in run
write_collapsed_GFF_with_CDS(isoforms_info, corrGTF, corrGTF+'.cds.gff')
File "/scratch/BLRseq/bin/SQANTI2/sqanti_qc2.py", line 399, in write_collapsed_GFF_with_CDS
for r in reader:
File "/home/psur9757/.local/lib/python2.7/site-packages/cupcake/io/GFF.py", line 393, in next
return self.read()
File "/home/psur9757/.local/lib/python2.7/site-packages/cupcake/io/GFF.py", line 550, in read
assert raw[2] == 'transcript'
AssertionError
Hi Liz,
I am unsure what is the python STAR module that you refer to in the line 44 of sqanti_qc2.py? Thanks!
Hi Elizabeth,
I am sorry for asking a question rather than reporting an error here (I can't find another way to contact you).
When I see an "associated_gene" columns from a result file ( *classification.txt ), sometime I can see two gene names are connected by underscore (""), which isn't in the reference. First I thought it could mean a fused gene, but it said "genic" at structural_category, not "fusion", so I am not sure how to interpret this. Could you please explain what these types of "associated genes" means? Thanks a lot!
FYI, it appears that recently, cDNA_Cupcake created a python 2.7 branch. The most recent master only supports python 3.7. In your installation instructions, perhaps suggest after cloning the repo changing to the 2.7 branch?
E.g.
cd cDNA_Cupcake
git checkout origin/Py2_v8.7.x
python setup.py build
pip install --prefix=/some/path/cDNA_Cupcake/ --ignore-installed .
Best,
Ali
My fasta file is not the ouput from pacbio (from nanopore instead). and I got the error as such:
Invalid input IDs! Expected PB.X.Y or PB.X.Y|xxxxx or PBfusion.X format but saw ENSG00000197956.9_153534599_153535991_1 instead. Abort!
is it possibly to use custom fasta file as input for SQANTI2?
Ok, I'm making progress in getting the latest version of squanti2 to run, but I've now hit the error below. I wonder if there has been a change in biopython that causes the issue or if there is any other version incompatibility. Have you perhaps encountered this issue before?
Traceback (most recent call last):
File "/sc/orga/projects/vanbah01a/opt/SQANTI2//sqanti_qc2.py", line 1539, in <module>
main()
File "/sc/orga/projects/vanbah01a/opt/SQANTI2//sqanti_qc2.py", line 1535, in main
run(args)
File "/sc/orga/projects/vanbah01a/opt/SQANTI2//sqanti_qc2.py", line 1181, in run
orfDict = correctionPlusORFpred(args, genome_dict)
File "/sc/orga/projects/vanbah01a/opt/SQANTI2//sqanti_qc2.py", line 363, in correctionPlusORFpred
err_correct(args.genome, corrSAM, corrFASTA, genome_dict=genome_dict)
File "/sc/orga/projects/vanbah01a/opt/cDNA_Cupcake/sequence/err_correct_w_genome.py", line 31, in err_correct
seq = sp.consistute_genome_seq_from_exons(genome_dict, r.sID, r.segments, r.flag.strand)
File "/sc/orga/projects/vanbah01a/opt/cDNA_Cupcake/sequence/coordinate_mapper.py", line 190, in consistute_genome_seq_from_exons
return seq.tostring()
AttributeError: 'Seq' object has no attribute 'tostring'
I just installed the latest version of SQANTI2, and had this error:
Traceback (most recent call last):
File "SQANTI2-master/sqanti_qc2.py", line 66, in
v1, v2 = map(int, cupcake.version.split('.'))
AttributeError: 'module' object has no attribute 'version'
Hello Magdoll,
I did run the sqanti_qc2.py according to your suggestion as below;
sqanti_qc2.py -t 30 ../hq_isoforms.fasta.fusion.rep.fq ../gencode.v30.annotation.gtf ../GRCh38.p12.genome.fa --cage_peak ../hg38.cage_peak_phase1and2combined_coord.bed --coverage ../intropolis.v1.hg19_with_liftover_to_hg38.tsv.min_count_10.modified --is_fusion
I did replace the collapse fastq with the fusion fastq, which contained the PBfusion.x ids. But, I got an error as below. Could you please let me know what was wrong? Thank you for your help!
R scripting front-end version 3.5.1 (2018-07-02)
Cleaning up isoform IDs...
Cleaned up isoform fasta file written to: /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/nalm6e/fusion/hq_isoforms.fasta.fusion.rep.renamed.fasta
[M::mm_idx_gen::91.5981.86] collected minimizers
[M::mm_idx_gen::123.6722.81] sorted minimizers
[M::main::123.6732.81] loaded/built the index for 593 target sequence(s)
[M::mm_mapopt_update::128.2972.74] mid_occ = 803
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 593
[M::mm_idx_stat::130.4982.71] distinct minimizers: 167309999 (34.26% are singletons); average occurrences: 6.324; average spacing: 3.074
[M::worker_pipeline::131.4052.74] mapped 43 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -ax splice --secondary=no -C5 -ub -t 30 /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/GRCh38.p12.genome.fa /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/nalm6e/fusion/hq_isoforms.fasta.fusion.rep.renamed.fasta
[M::main] Real time: 131.678 sec; CPU: 360.729 sec; Peak RSS: 18.695 GB
output written to /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/nalm6e/squanti2-3/hq_isoforms.fasta.fusion.rep.renamed_corrected.fasta
error: , /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/nalm6e/squanti2-3/hq_isoforms.fasta.fusion.rep.renamed_corrected.fasta
**** Running SQANTI...
**** Parsing provided files....
Reading genome fasta /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/GRCh38.p12.genome.fa....
****Aligning reads with Minimap2...
**** Predicting ORF sequences...
Traceback (most recent call last):
File "/opt/SQANTI2/20190618/bin/sqanti_qc2.py", line 1599, in
main()
File "/opt/SQANTI2/20190618/bin/sqanti_qc2.py", line 1595, in main
run(args)
File "/opt/SQANTI2/20190618/bin/sqanti_qc2.py", line 1248, in run
orfDict = correctionPlusORFpred(args, genome_dict)
File "/opt/SQANTI2/20190618/bin/sqanti_qc2.py", line 480, in correctionPlusORFpred
if subprocess.check_call(cmd, shell=True, cwd=gmst_dir)!=0:
File "/opt/SQANTI2/20190618/lib/python2.7/subprocess.py", line 190, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'perl /opt/SQANTI2/20190618/bin/utilities/gmst/gmst.pl -faa --strand direct --fnn --output /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/nalm6e/squanti2-3/GMST/GMST_tmp /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/nalm6e/squanti2-3/hq_isoforms.fasta.fusion.rep.renamed_corrected.fasta' returned non-zero exit status 1
Dear @Magdoll,
Thank you for providing a wonderful tool for long read data analysis.
I am actually trying running SQANTI2 using cDNA nanopore data and I wanted to ask you some about the result.
Jungwoo
Hi, Magdoll.
I follow the Iso-Seq3 -> Cupcake ToFU -> SQATNI pipeline. I have some question about the sqanti result.
First,I noticed the result in XX.collapsed.rep.renamed_corrected.gtf is like
transcript_id "PB.1.1"; gene_id "PB.1.1";
instead of
transcript_id "PB.1.1"; gene_id "PB.1";
Second, I found no gtf output after I run sqanti_filter2.py although I can grep filterd gene in corrected.gtf to make filter.gtf.
Thanks for you pipeline. These are very helpful !
Hi @Magdoll
I tried to make an analysis of transcripts of goat using SQANTI, but I find there are not gene names corresponding to transcripts in the output report. I also find some novel isoforms have their own names ,such an 'novelGene_AS' in the 'antisense or genic', while others don't have. I have checked the 'ref-GTF' and 'ref-GENOME FASTA', and I thought they don't have problems. I make a comparision of your example Ipunt.fa and mine, and I found in my Input.fa there are not corresponding chromosomes , they aren't mapped. So I want to know whether the Input.fa was wrong and how I should do next.
Question.pptx
Hello @Magdoll,
I used SQANTI2 results to update my old gene annotation results. i want to know is there any quick way for me to generate gff3 format annotations based on SQANTI2 results ?
Best,
Xu
Hello,
I had an error (subprocess.CalledProcessError) when I ran the SQANTI2 (v3.5.1). I attached the log at below. Could you please give me an advice how to resolve it? Thank you!
Taehee
sqanti_qc2.py -t 30 collapse.collapsed.rep.fq gencode.v30.annotation.gtf GRCh38.p12.genome.fa --cage_peak hg38.cage_peak_phase1and2combined_coord.bed --coverage intropolis.v1.hg19_with_liftover_to_hg38.tsv.min_count_10.modified --fl_count collapse.collapsed.abundance.txt
R scripting front-end version 3.5.1 (2018-07-02)
Cleaning up isoform IDs...
Cleaned up isoform fasta file written to: /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/nalm6k/collapse/collapse.collapsed.rep.renamed.fasta
[M::mm_idx_gen::83.9461.76] collected minimizers
[M::mm_idx_gen::94.8383.25] sorted minimizers
[M::main::94.8393.25] loaded/built the index for 593 target sequence(s)
[M::mm_mapopt_update::100.0843.13] mid_occ = 803
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 593
[M::mm_idx_stat::102.3543.08] distinct minimizers: 167309999 (34.26% are singletons); average occurrences: 6.324; average spacing: 3.074
[M::worker_pipeline::115.0895.66] mapped 23933 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -ax splice --secondary=no -C5 -uf -t 30 /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/GRCh38.p12.genome.fa /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/nalm6k/collapse/collapse.collapsed.rep.renamed.fasta
[M::main] Real time: 115.208 sec; CPU: 651.074 sec; Peak RSS: 23.933 GB
output written to /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/nalm6k/test/collapse.collapsed.rep.renamed_corrected.fasta
**** Parsing Isoforms....
Input pattern: /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/intropolis/intropolis.v1.hg19_with_liftover_to_hg38.tsv.min_count_10.modified. The following files found and to be read as junctions:
/net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/intropolis/intropolis.v1.hg19_with_liftover_to_hg38.tsv.min_count_10.modified
5041602 junctions read. 0 junctions added to both strands because no strand information from STAR.
**** RT-switching computation....
**** Reading Full-length read abundance files...
WARNING: PB.6189.1.gene not found in FL count file. Assign count as 0.
WARNING: PB.5162.3.gene not found in FL count file. Assign count as 0.
WARNING: PB.7490.1.gene not found in FL count file. Assign count as 0.
.
.
.
WARNING: PB.7367.10.gene not found in FL count file. Assign count as 0.
WARNING: PB.2750.7.gene not found in FL count file. Assign count as 0.
WARNING: PB.4315.1.gene not found in FL count file. Assign count as 0.
Isoforms expression files not provided.
**** Writing output files....
**** Generating SQANTI report....
Attaching package: ‘dplyr’
The following object is masked from ‘package:gridExtra’:
combine
The following object is masked from ‘package:reshape’:
rename
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
Warning messages:
1: Removed 1 rows containing missing values (geom_text).
2: Removed 1 rows containing missing values (geom_bar).
3: Removed 1 rows containing missing values (geom_text).
Error: StatBin requires a continuous x variable: the x variable is discrete. Perhaps you want stat="count"?
Execution halted
**** Running SQANTI...
**** Parsing provided files....
Reading genome fasta /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/GRCh38.p12.genome.fa....
****Aligning reads with Minimap2...
**** Predicting ORF sequences...
**** Parsing Reference Transcriptome....
**** Reading Splice Junctions coverage files.
**** Reading CAGE Peak data.
**** Performing Classification of Isoforms....
Number of classified isoforms: 47866
Traceback (most recent call last):
File "/opt/SQANTI2/20190618/bin/sqanti_qc2.py", line 1717, in
main()
File "/opt/SQANTI2/20190618/bin/sqanti_qc2.py", line 1713, in main
run(args)
File "/opt/SQANTI2/20190618/bin/sqanti_qc2.py", line 1531, in run
if subprocess.check_call(cmd, shell=True)!=0:
File "/opt/SQANTI2/20190618/lib/python2.7/subprocess.py", line 190, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/opt/SQANTI2/20190618/bin/Rscript /opt/SQANTI2/20190618/bin/utilities//SQANTI_report2.R /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/nalm6k/test/collapse.collapsed.rep_classification.txt /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/nalm6k/test/collapse.collapsed.rep_junctions.txt' returned non-zero exit status 1
Hi Liz,
I have been trying to run SQANTI2, but have run into an error and I'm having a hard time identifying what exactly could be causing it. This is the log out I get from the squanti_qc2.py run:
R scripting front-end version 3.4.3 (2017-11-30) Cleaning up isoform IDs... Cleaned up isoform fasta file written to: cupcake_processing/isoforms.polished.hq.collapsed.rep.renamed.fasta Write arguments to isoforms.collapsed.sqanti2qc.params.txt... **** Running SQANTI... **** Parsing provided files.... Reading genome fasta data/hg38/GRCh38.p13_refseq_genomic.fna.... Error corrected FASTA isoforms.polished.hq.collapsed.rep.renamed_corrected.fasta already exists. Using it... **** Predicting ORF sequences... ORF file isoforms.polished.hq.collapsed.rep.renamed_corrected.faa already exists. Using it.... **** Parsing Reference Transcriptome.... refAnnotation_isoforms.collapsed.sqanti2qc.genePred already exists. Using it. **** Parsing Isoforms.... Splice Junction Coverage files not provided. **** Performing Classification of Isoforms.... Traceback (most recent call last): File "SQANTI2/sqanti_qc2.py", line 1996, in <module> main() File "SQANTI2/sqanti_qc2.py", line 1991, in main run(args) File "SQANTI2/sqanti_qc2.py", line 1613, in run isoforms_info = isoformClassification(args, isoforms_by_chr, refs_1exon_by_chr, refs_exons_by_chr, junctions_by_chr, junctions_by_gene, start_ends_by_gene, genome_dict, indelsJunc, orfDict) File "SQANTI2/sqanti_qc2.py", line 1459, in isoformClassification orfDict[rec.id].cds_genomic_end = m[orfDict[rec.id].cds_end-1] + 1 # make it 1-based KeyError: 1321
As you can see, everything works fine until the Isoform Classification step. Do you have any idea whether it could be a bug, or something related to the particular formatting that the ORF prediction outputs using my data? I'm saying this because it seems to be some sort of error in the creation of the dictionary storing the ORF information...
Thanks!
Ángeles
PhD Student, Conesa Lab
Hi , Liz:
I am running sqanti_qc2.py on version 3.3, and my commands are like this:
python sqanti_qc2.py --aligner_choice gmap -x Gbarbadense -t 20 --geneid all.collapsed.rep.fa genome.gtf genome.fa -o test
In my STDERR, I get this Error :
output written to all.collapsed.rep.renamed_corrected.fasta
**** Parsing Isoforms....
Traceback (most recent call last):
File "/export/pipeline/RNASeq/Software/SQANTI2/V3.3/SQANTI2-master/sqanti_qc2.py", line 1746, in
main()
File "/export/pipeline/RNASeq/Software/SQANTI2/V3.3/SQANTI2-master/sqanti_qc2.py", line 1742, in main
run(args)
File "/export/pipeline/RNASeq/Software/SQANTI2/V3.3/SQANTI2-master/sqanti_qc2.py", line 1389, in run
write_collapsed_GFF_with_CDS(isoforms_info, corrGTF, corrGTF+'.cds.gff')
File "/export/pipeline/RNASeq/Software/SQANTI2/V3.3/SQANTI2-master/sqanti_qc2.py", line 387, in write_collapsed_GFF_with_CDS
for r in reader:
File "/export/pipeline/RNASeq/Software/cDNA_Cupcake/v8.0/cDNA_Cupcake/cupcake/io/GFF.py", line 393, in next
return self.read()
File "/export/pipeline/RNASeq/Software/cDNA_Cupcake/v8.0/cDNA_Cupcake/cupcake/io/GFF.py", line 550, in read
assert raw[2] == 'transcript'
AssertionError
And in one of the output files "all.collapsed.rep.renamed_corrected.gtf" , the third column only contains "exon", but no "transcript".
Would you please kindly check this error out? Thanks a lot.
Feng
Hi There!
Just get an error by : python sqanti_qc2.py -t 30
Here is what script return:
Traceback (most recent call last):
File "sqanti_qc2.py", line 117, in
if os.system(RSCRIPTPATH + " --version")!=0:
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
I already check that I fit all prerequisite. Many thanks in advance.
Hi, Liz:
Thanks for your reply on issue #23 , and it solved my problem. But I ran into another error when running version 3.5. My commands were :
python sqanti_qc2.py --aligner_choice minimap2 -t 20 --geneid all.collapsed.rep.fa genome.gtf genome.fa -o test
The error message was like
output written to all.collapsed.rep.renamed_corrected.fasta
Skipping PB.9452.1 because unmapped.
**** Parsing Isoforms....
Traceback (most recent call last):
File "/export/pipeline/RNASeq/Software/SQANTI2/v3.5/SQANTI2-master/sqanti_qc2.py", line 1753, in
main()
File "/export/pipeline/RNASeq/Software/SQANTI2/v3.5/SQANTI2-master/sqanti_qc2.py", line 1749, in main
run(args)
File "/export/pipeline/RNASeq/Software/SQANTI2/v3.5/SQANTI2-master/sqanti_qc2.py", line 1392, in run
isoforms_info = isoformClassification(args, isoforms_by_chr, refs_1exon_by_chr, refs_exons_by_chr, junctions_by_chr, junctions_by_gene, start_ends_by_gene, genome_dict, indelsJunc, orfDict)
File "/export/pipeline/RNASeq/Software/SQANTI2/v3.5/SQANTI2-master/sqanti_qc2.py", line 1278, in isoformClassification
orfDict[rec.id].cds_genomic_end = m[orfDict[rec.id].cds_end-1] + 1 # make it 1-based
KeyError: 946
But when I changed the input isoforms from fasta format to gtf format, it completed successfully. The commands were:
python sqanti_qc2.py -g --geneid all.collapsed.gff genome.gtf genome.fa -o test
I wonder whether it was because of the code or the input isoform fasta file? Could you give any advices?
Thanks.
I've a question about using RSEM to quantify Iso-Seq collapsed long reads by RNA-seq short reads, then use it as a short read expression file for running SQANTI2. Sorry if here is not the place to discuss it.
As SQANTI paper stated that it can take Iso-Seq + RNA-seq as input. My understanding is to use the Iso-Seq collapsed long transcripts as reference then map short reads against using RSEM (or other tools). According to RSEM, it requires reference sequences using (rsem-prepare-reference command). I tried to run this command to build collapsed long reads as the reference, but it didn't work. For all the options in RSEM preparing reference step, I don't think which one should be used. If anyone can provide the commands or options that will be great.
For the Iso-Seq pipeline, I've GMAP aligned files, ToFu collpased transcripts, and for RNA-seq, paired end reads.
Cheers!
Hello Magdoll,
I have updated the latest version of the SQANTI2. But, I can't still use the '--is_fusion'. I have downloaded it from https://github.com/Magdoll/SQANTI2. Did I download the correct one? Thank you!
sqanti_qc2.py -t 30 ../collapse.collapsed.rep.fq ../gencode.v30.annotation.gtf ../GRCh38.p12.genome.fa --cage_peak ../hg38.cage_peak_phase1and2combined_coord.bed --coverage ../intropolis.v1.hg19_with_liftover_to_hg38.tsv.min_count_10.modified --fl_count ../abundance/collapse.collapsed.abundance.txt --is_fusion ../hq_isoforms.fasta.fusion.rep.fq
R scripting front-end version 3.5.1 (2018-07-02)
usage: sqanti_qc2.py [-h] [--aligner_choice {minimap2,deSALT,gmap}]
[--cage_peak CAGE_PEAK]
[--polyA_motif_list POLYA_MOTIF_LIST]
[--phyloP_bed PHYLOP_BED] [--skipORF] [--is_fusion] [-g]
[-e EXPRESSION] [-x GMAP_INDEX] [-t GMAP_THREADS] [-z]
[-o OUTPUT] [-d DIR] [-c COVERAGE] [-s SITES] [-w WINDOW]
[--geneid] [-fl FL_COUNT] [-v]
isoforms annotation genome
sqanti_qc2.py: error: unrecognized arguments: /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/nalm6e/fusion/hq_isoforms.fasta.fusion.rep.fq
Hi @Magdoll,
I was running sqanti2 for the first time and got an error message saying:
**** Parsing Reference Transcriptome....
.../SQANTI2/utilities/gtfToGenePred: error while loading shared libraries: libpng12.so.0: cannot open shared ob ject file: No such file or directory
Traceback (most recent call last):
File "SQANTI2/sqanti_qc2.py", line 1395, in
main()
File "SQANTI2/sqanti_qc2.py", line 1391, in main
run(args)
File "SQANTI2/sqanti_qc2.py", line 1084, in run
refs_1exon_by_chr, refs_exons_by_chr, junctions_by_chr, junctions_by_gene = reference_parser(args, genome_dict.keys())
File "SQANTI2/sqanti_qc2.py", line 480, in reference_parser
for r in genePredReader(referenceFiles):
File "SQANTI2/sqanti_qc2.py", line 90, in init
self.f = open(filename)
IOError: [Errno 2] No such file or directory: '.../refAnnotation_hq_transcripts.genePred'
(sqanti) lena@pgm-Precision-WorkStation-T7500: ... /SQANTI2/utilities/gtfToGenePred: error while loading shared libraries: libpng12.so.0: cannot open shared ob ject file: No such file or directory
-bash: /SQANTI2/utilities/gtfToGenePred:: No such file or directory
Do I need to provide a reference transcriptome file here??
Thanks for your help!
Lena
Hi @Magdoll ,
I created the environment and installed all the packages as per the tutorial.
I ran the following command:
$ python sqanti_qc2.py --aligner_choice=minimap2 ~/pacbio/testdir/mapped.fa Homo_sapiens.GRCh38.97.gtf HangleeGCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai
and get this msg:
Unable to import err_correct_w_genome or sam_to_gff3.py! Please make sure cDNA_Cupcake/sequence/ is in $PYTHONPATH.
sys.path shows that cDNA_Cupcake/sequence/ is in pythonpath
my Cupcake version:
Last Updated: 06/25/2019
Current version: 7.9
Thanks
Hi Liz,
Thank you for your help last time - can I ask one more help this time?
I have test SQANTI2, and it seems that when a read("read1") is splitted and mapped to difference sites, SQANTI2 rename these as "read1" and "read1_dup2".
However, it also seems this process changes read names on only gtf and classification output file, but not on FASTA and sam file (I can find "_dup2" and "_dup3" from ".renamed_corrected.gtf" , but not from ".renamed_corrected.fasta" and "".renamed_corrected.sam". Therefore, FASTA and sam file have duplicated reads based on read name. And this troubles many parts of downstream analyses, as many programs identify reads/isoforms only using read names.
Is this what SQANTI2 intended? If not - is there any way to fix this issue easily? Any advice would be really helpful!
Thanks,
Yeji
when i use --is_fudion, get this error.
the fasta file is from fusion_finder.py
python sqanti_qc2.py saf.fusion.rep.fa saf.gff3 genome.fasta --aligner_choice=minimap2 --is_fusion
if i use --is_fusion, it won't run ORFs predicted, but the next step maybe use the *.genePred?
R scripting front-end version 3.5.1 (2018-07-02)
[M::mm_idx_gen::44.284*0.97] collected minimizers
[M::mm_idx_gen::75.330*0.98] sorted minimizers
[M::main::75.345*0.98] loaded/built the index for 12 target sequence(s)
[M::mm_mapopt_update::77.569*0.98] mid_occ = 905
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 12
[M::mm_idx_stat::79.243*0.98] distinct minimizers: 79003367 (55.73% are singletons); average occurrences: 4.578; average spacing: 2.923
[M::worker_pipeline::81.420*0.98] mapped 99 sequences
[M::main] Version: 2.17-r943-dirty
[M::main] CMD: minimap2 -ax splice --secondary=no -C5 -uf -t 1 /public/home/genome/genome.fasta /public/home/saf.fusion.rep.renamed.fasta
[M::main] Real time: 81.481 sec; CPU: 79.892 sec; Peak RSS: 7.007 GB
/public/home/saf.gff3 doesn't appear to be a GTF file (GFF not supported by this program)
WARNING: Currently if --is_fusion is used, no ORFs will be predicted.
Cleaning up isoform IDs...
Cleaned up isoform fasta file written to: /public/home/saf.fusion.rep.renamed.fasta
output written to
WARNING: Skipping ORF prediction because user requested it. All isoforms will be non-coding!
WARNING: All input isoforms were predicted as non-coding
Traceback (most recent call last):
File "/public/home/software/SQANTI2-master/sqanti_qc2.py", line 1794, in <module>
main()
File "/public/home/software/SQANTI2-master/sqanti_qc2.py", line 1790, in main
run(args)
File "/public/home/software/SQANTI2-master/sqanti_qc2.py", line 1418, in run
refs_1exon_by_chr, refs_exons_by_chr, junctions_by_chr, junctions_by_gene, start_ends_by_gene = reference_parser(args, list(genome_dict.keys()))
File "/public/home/software/SQANTI2-master/sqanti_qc2.py", line 625, in reference_parser
for r in genePredReader(referenceFiles):
File "/public/home/zcyu/software/SQANTI2-master/sqanti_qc2.py", line 124, in __init__
self.f = open(filename)
FileNotFoundError: [Errno 2] No such file or directory: '/public/home/code/refAnnotation_saf.fusion.rep.genePred'
Hi Elizabeth,
When using sqanti_qc2.py, if the filename of the input fasta file is sufficiently long (and possibly including the path too), GeneMark will throw an error.
Error on open output file /mnt/data/IsoSeq_20190729/GMST/GMST_tmp
GeneMarkS: error on last system call, error code 256
Abort program!!!
Examination of the GeneMark log file suggests that gmst.pl is happy, but probuild is not:
/mnt/data/IsoSeq_20190729/SQANTI2/utilities/gmst/probuild --par /mnt/data/IsoSeq_20190729/SQANTI2/utilities/gmst/par_1.default --clean_join sequence --seq /mnt/data/IsoSeq_20190729/m54178_190723_19024
0.ccs.demux.primer_5p--primer_3p_flnc_polished_CUPCAKE.collapsed.rep.renamed_corrected.fasta --log gms.log
Error on last system call, error code 256
Abort program!!!
Direct invocation of probuild yields the following uninformative error:
error in command line
This feature is clearly related to GeneMark and not SQANTI2. Nevertheless, keeping the length of input filenames to a minimum solves this issue.
Sincerely,
Mark
Hi @Magdoll,
may will be possible to download somewhere a "SQANTI2 ready-to-use" file (in STAR junction format) corresponding to Intropolis for hg38 and hg19 genome ? Or at least include a script that permits to make the conversion in STAR junction format, please ?
In fact, it seems that none of the files downloadable in the Intropolis github are in the good format for SQANTI2, or maybe I am wrong ?
Thank you very much in advance for your answer
Best regards
Hello Magdoll,
R scripting front-end version 3.5.1 (2018-07-02)
usage: sqanti_qc2.py [-h] [--aligner_choice {minimap2,deSALT,gmap}]
[--cage_peak CAGE_PEAK]
[--polyA_motif_list POLYA_MOTIF_LIST]
[--phyloP_bed PHYLOP_BED] [--skipORF] [-g]
[-e EXPRESSION] [-x GMAP_INDEX] [-t GMAP_THREADS] [-z]
[-o OUTPUT] [-d DIR] [-c COVERAGE] [-s SITES] [-w WINDOW]
[--geneid] [-fl FL_COUNT] [-v]
isoforms annotation genome
sqanti_qc2.py: error: unrecognized arguments: --is_fusion /net/isi-dcnl/ifs/user_data/Seq/PacBio/thkang/lili_wang_iso-seq/nalm6e/fusion/hq_isoforms.fasta.fusion.abundance.txt
Hi @Magdoll
I was trying to run sqanti_qc2.py
and came across the following error
File "sqanti_qc2.py", line 43, in
from sam_to_gff3 import convert_sam_to_gff3
File "/u/home/a/Resource/cDNA_Cupcake/sequence/sam_to_gff3.py", line 85
parser.add_argument("-s", "--source", required=True, help="source name (ex: hg38, mm10)")
^
IndentationError: unexpected indent
cDNA_Cupcake/sequence is in my PATH. How can I resolve it?
I'm running the entire IsoSeq pipeline on some PacBio CCS reads (mRNA, average CCS length ~2kb). For now I'm running on a test dataset of 1000 reads. My pipeline has all the recommended steps (lima, isoseq3 refine, minimap2, etc). Of these, the sqanti2_qc step is the slowest by far and seems to take 50x longer than all other steps put together. Particularly, I'm giving it 8 cores (-t 8) but it only ever seems to use 1 core. Is this normal or is there a parallelization option I'm missing?
You can see my pipeline at https://github.com/gmstanle/nf-core-scisoseq.
Hello Magdoll,
My collapsed header after Iso-Seq3 and minimap2 is slightly different to your example. Is it ok to run SQANTI? Thank you!
PB.1.1|chr1:184917-199875(-)|transcript/19866 transcript/19866 full_length_coverage=28;length=1754;num_subreads=60
PB.2.1|chr1:827670-843589(+)|transcript/11029 transcript/11029 full_length_coverage=2;length=2604;num_subreads=54
PB.3.1|chr1:944203-959277(-)|transcript/15184 transcript/15184 full_length_coverage=2;length=2218;num_subreads=22
.
.
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.