Comments (7)
Hi @markopetek ,
Your count file all_sizes.quivered_hq_MPE_S2.fastq.collapsed.abundance.txt
is somehow incomplete. It is missing, for example, the count information for PB.1.1
, which is present in other collapse output such as all_sizes.quivered_hq_MPE_S2.fastq.collapsed.GFF
.
The file all_sizes.quivered_hq_MPE_S2.fastq.collapsed.group.txt
shows that you have 9703 isoforms, this means there should be 9703+15 (for header)=9718 lines in all_sizes.quivered_hq_MPE_S2.fastq.collapsed.abundance.txt
but it has only 4339.
--Liz
from cdna_cupcake.
Hi Liz,
I thought this incompleteness of the abundance file might be caused by some glitch in the gmap produced sam file so I've mapped the reads using STARlong, sorted and rerun the collapse and filer script again but I get the same erorr. Would you suggest any parameter change when running the collapse script in order to avoid this outcome? Maybe the -c 0.99 -i 0.95 are too strict since I've mapped cultivated potato transcripts to a reference which is a different subspecies - that might be why the gff has more isoforms than the abundance file.
from cdna_cupcake.
Hi @markopetek ,
Can you email me the input FASTQ file? I need it to run collapse so I can see if it's a bug in collapse. Whether you use GMAP or STAR or change parameters should have no effect.
from cdna_cupcake.
from cdna_cupcake.
Hi @markopetek ,
I believe I have identified the issue. Somehow some of your IDs had an extra underline in your FASTQ file, but they were not there in the cluster_report.csv
(the cluster report is what is read to generate counts).
The group txt looked like this:
PB.1.1 i0_HQ__S2|c22385/f3p0/1042
PB.2.1 i0_HQ__S2|c218058/f3p12/1377
PB.2.2 i0_HQ__S2|c113604/f3p17/1237
PB.2.3 i0_HQ__S2|c356155/f2p6/825
PB.3.1 i0_HQ__S2|c212424/f2p1/1373
PB.3.2 i0_HQ__S2|c4103/f3p1/1275
PB.4.1 i1_HQ_S2|c114684/f2p18/3644
PB.4.2 i1_HQ_S2|c14710/f2p15/2321
PB.5.1 i1_HQ_S2|c161525/f14p22/3663
where you notice some of the IDs had two underlines __
instead of one before S2
. This was consistent with the HQ fastq IDs but were inconsistent with the cluster_report.csv
, which all had only a single underline:
cluster_id,read_id,read_type
i3_ICE_S2|c0,m160923_181413_42165_c101102612550000001823258304261736_s1_p0/22583/37_16585_CCS,FL
i3_ICE_S2|c1,m160923_181413_42165_c101102612550000001823258304261736_s1_p0/34221/37_14492_CCS,FL
i3_ICE_S2|c2,m160923_181413_42165_c101102612550000001823258304261736_s1_p0/127040/12966_55_CCS,FL
(don't be alarmed to see _ICE_
instead of _HQ_
in cluster_report.csv
; it's supposed to be that way because at the end of ICE, we don't know whether clusters become HQ or LQ; the script get_abundance_post_collapse.py
understands this difference)
Once I changed the group file to remove the extra underline and ran get_abundance_post_collapse.py
again, I got the right results and subsequently filter_away_subset.py
worked as well.
I've put the fixed group file and filtered results here:
https://www.dropbox.com/s/ec7uhd3sbllenkb/fixed.tar.gz?dl=0
from cdna_cupcake.
Thanks for the help Liz. I see that in the fastq file the sequences starting with @i0_HQ
are followed by two underlines, those starting with @i1_HQ
or @i2_HQ
have only one. I'll rerun the gmap mapping in order to include the chloroplast and mitochondrial genomes besides the nuclear genome. I'll fix the input fastq before running the collapse script by replacing all @i0_HQ__S2
with @i0_HQ_S2
. Is that OK or should I rather fix the group.txt file as you did?
from cdna_cupcake.
Fixing the HQ FASTQ file header itself is easier because that's the "root" of the problem. Fixing the group.txt is also fine, you just have to remember to do it every time you re-run collapse.
What's not clear to me is why for @i0_HQ
there is two underlines. I've never seen it happen. Hopefully it's just a human error somewhere.
--Liz
from cdna_cupcake.
Related Issues (20)
- Question: where does the count.txt file come from?
- cluster id mismatch issue in get_abundance_post_collapse.py HOT 3
- chain_samples.py does not work!
- cupcake2 directory missing, compilation fails HOT 4
- Create sam format for collapse_isoforms_by_sam.py
- lima for skera.bam
- fusion_collate_info.py script has bugs and its not working.
- TypeError: iter_gmap_sam() got an unexpected keyword argument 'type' using collapse_isoforms_by_sam.py
- run_preCluster.py HOT 1
- setup.py requires sklearn instead of scikit-learn HOT 1
- fusion_collate_info.py KeyError: 'count_fl' HOT 3
- Cython compiler error when installing/building cDNA_cupcake HOT 2
- solved
- Saturation analysis bug fixes; compatibility with newer isoseq cluster [make_file_for_subsampling_from_collapsed.py]
- AttributeError: module 'numpy' has no attribute 'int'. HOT 2
- ModuleNotFoundError: No module named 'cupcake.tofu' after installing cdna_cupcake in conda subenv HOT 3
- fusion_finder.py HOT 1
- Demultiplex after clustering / genome alignment
- cdna_cupcake与SQANTI3
- collapse_isoforms_by_sam.py out of memory
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cdna_cupcake.