Comments (5)
Hi @Mel66 ,
(1) Sorry, my mind is blanking on the _dup
suffix and I cannot find in my code where I would've added that. Can you be more specific? Where did the _dup
suffix show up? In which files where they not present and in which files did they show up?
(2) Only FSM_class
should be ignored for now since I have not implemented them yet. The other fields 28-31 about ORF and CDS prediction are valid and can be used, unless you specifically used the --skipORF
option in which case they will all be zero.
--Liz
from sqanti2.
The suffix _dup appears in classification.txt, junctions.txt, .renamed_corrected.genePred and .renamed_corrected.gtf.
It doesn't appear in renamed_corrected.fasta, but this file contains additional sequences. These added sequences are named the same as some of the original sequences. For example, renamed_corrected.fasta contains 4 sequences named transcript/4922, while the original fasta and renamed.fasta only have 1 sequence with this name. The four previously mentioned files contain transcript/4922_dup2, transcript/4922_dup3, and transcript/4922_dup4. According to classifications.txt, these 3 transcripts are encoded by different genes.
Thank you for your help.
Melissa
from sqanti2.
Hi @Mel66 ,
I'm sorry for the late response as I was on vacation last week.
If renamed_corrected.fasta
already had 4 duplicate sequences, I'm going to guess this is because when it was mapped back to the genome (for the genome-based error correction which is SQANTI's first step), it got mapped to multiple places. This is entirely possible when different aligners (or even different versions of the same aligner) was used. For example, GMAP might decide to map a sequence continuously while minimap2 might decide to break it up.
Are you willing to share the input data confidentially? Is this on human? To reproduce your results, I would need the input fasta, the ref genome, and the ref annotation. If you can share, please give me your email so I can request a file upload.
--Liz
from sqanti2.
Hi Magdoll,
sorry for the late response, I didn’t see you had answered already. You were right, some of the sequences mapped to multiple genes on the genome, which caused the dups. I worked around this problem by using a gtf file for classification instead of the fasta file.
Sqanti_qc2.py didn’t seem to work with gtf files because it has some bugs. I made some changed for myself, and my results seem fine now. I added the errors I got below, maybe they can be of use to you.
Melissa.
I got this error while running SQANTI with a gtf (-g). I "fixed" this by adding "if not args.gtf:" to line 1377.
R scripting front-end version 3.5.0 (2018-04-23)
Cleaning up isoform IDs...
Cleaned up isoform fasta file written to: /home/mvantilborg/hq.renamed.fasta
Traceback (most recent call last):
File "SQANTI2/sqanti_qc2.py", line 1395, in
main()
File "SQANTI2/sqanti_qc2.py", line 1384, in main
if args.aligner_choice == "minimap2":
AttributeError: 'Namespace' object has no attribute 'aligner_choice'
Then I got the error below. I worked around it by commenting out line 1369, because it looked like the function rename_isoform_seqids was trying to read my gtf like it was a fasta or fastq file, and the resulting renamed fasta file (hq.renamed.fasta) was empty.
R scripting front-end version 3.5.0 (2018-04-23)
Cleaning up isoform IDs...
Cleaned up isoform fasta file written to: /home/mvantilborg/hq.renamed.fasta
**** Running SQANTI...
**** Parsing provided files....
Reading genome fasta /usr/local/Genomes/species/H.sapiens/GRCh38_no_alt_analysis_set/reference.fa....
Skipping aligning of sequences because gtf file was provided.
ERROR: gtf has not annotation lines.
Then I got this error below. I worked around this by using the –skipORF prediction because I was doing an ORF prediction myself. But it was probably caused by skipping the renaming.
R scripting front-end version 3.5.0 (2018-04-23)
Cleaning up isoform IDs...
Cleaned up isoform fasta file written to: /home/mvantilborg/hq.gff
**** Running SQANTI...
**** Parsing provided files....
Reading genome fasta /usr/local/Genomes/species/H.sapiens/GRCh38_no_alt_analysis_set/reference.fa....
Skipping aligning of sequences because gtf file was provided.
Indels will be not calculated since you ran SQANTI without alignment step (SQANTI with gtf format as transcriptome input).
**** Predicting ORF sequences...
Expected GMST output IDs to be of format ' gene_4|GeneMark.hmm|_aa||<cds_start>|<cds_end>' but instead saw: PB.1.1 gene=PB.1 gene_1|GeneMark.hmm|138_aa|+|3|419! Abort!
from sqanti2.
Close unless further notice.
from sqanti2.
Related Issues (20)
- input isoforms.fasta for chain_samples.py HOT 4
- isoform map to scaffold which without reference gene HOT 1
- Assertion Error HOT 7
- gene_id annotation HOT 2
- Possible error from input gff files instead of fasta/fastq/gtf files HOT 10
- Tamma Collapse for 5' cap selected samples HOT 1
- sqanti2_qc.py fails with assertion error HOT 6
- Isoform class transcript distribution versus rarefaction curve HOT 1
- How to utilize sqanti2 classification result for genome optimizaion? HOT 1
- python 3.7 bx-python error HOT 1
- About adapting SQANTI2 to process transcripts from nanopore cDNA sequencing HOT 2
- Problem running with cage_peak HOT 2
- _corrected.gtf file without transcript line HOT 4
- FileNotFoundError: missing refAnnotation_***.genePred file? HOT 3
- Error with sqanti_qc2.py command HOT 4
- sqanti_filter.py HOT 4
- issue with python=3.7
- Gene_IDs in GTFs HOT 3
- Recurring perl errors - gmst.pl HOT 7
- error using GFF or converted GTF HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sqanti2.