For S1 sample the filter_away_subset.py finished without a problem but for the second

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

filter_away_subset.py error about cdna_cupcake HOT 7 CLOSED

magdoll commented on June 22, 2024

filter_away_subset.py error

from cdna_cupcake.

Comments (7)

Magdoll commented on June 22, 2024

Hi @markopetek ,

Your count file all_sizes.quivered_hq_MPE_S2.fastq.collapsed.abundance.txt is somehow incomplete. It is missing, for example, the count information for PB.1.1, which is present in other collapse output such as all_sizes.quivered_hq_MPE_S2.fastq.collapsed.GFF.

The file all_sizes.quivered_hq_MPE_S2.fastq.collapsed.group.txt shows that you have 9703 isoforms, this means there should be 9703+15 (for header)=9718 lines in all_sizes.quivered_hq_MPE_S2.fastq.collapsed.abundance.txt but it has only 4339.

--Liz

from cdna_cupcake.

markopetek commented on June 22, 2024

Hi Liz,

I thought this incompleteness of the abundance file might be caused by some glitch in the gmap produced sam file so I've mapped the reads using STARlong, sorted and rerun the collapse and filer script again but I get the same erorr. Would you suggest any parameter change when running the collapse script in order to avoid this outcome? Maybe the -c 0.99 -i 0.95 are too strict since I've mapped cultivated potato transcripts to a reference which is a different subspecies - that might be why the gff has more isoforms than the abundance file.

from cdna_cupcake.

Magdoll commented on June 22, 2024

Hi @markopetek ,
Can you email me the input FASTQ file? I need it to run collapse so I can see if it's a bug in collapse. Whether you use GMAP or STAR or change parameters should have no effect.

from cdna_cupcake.

markopetek commented on June 22, 2024

Dear Liz, this are the fastq and the cluster report file that you've requested: https://www.dropbox.com/sh/kyn9j8txg6vqlnd/AAB9nLr8_GybALY5N1NKCo9Ia?dl=0 Let me know what you find out. With kind regards, Marko

…

-------------------- Lep pozdrav Marko Petek 2017-08-25 20:19 GMT+02:00 Magdoll <[email protected]>:

Hi @markopetek <https://github.com/markopetek> , Can you email me the input FASTQ file? I need it to run collapse so I can see if it's a bug in collapse. Whether you use GMAP or STAR or change parameters should have no effect. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMN1X0Dvz_3zvZZFrRDwTfXhZyC0R7ZJks5sbxA1gaJpZM4O_9m0> .

from cdna_cupcake.

Magdoll commented on June 22, 2024

Hi @markopetek ,

I believe I have identified the issue. Somehow some of your IDs had an extra underline in your FASTQ file, but they were not there in the cluster_report.csv (the cluster report is what is read to generate counts).

The group txt looked like this:

PB.1.1  i0_HQ__S2|c22385/f3p0/1042
PB.2.1  i0_HQ__S2|c218058/f3p12/1377
PB.2.2  i0_HQ__S2|c113604/f3p17/1237
PB.2.3  i0_HQ__S2|c356155/f2p6/825
PB.3.1  i0_HQ__S2|c212424/f2p1/1373
PB.3.2  i0_HQ__S2|c4103/f3p1/1275
PB.4.1  i1_HQ_S2|c114684/f2p18/3644
PB.4.2  i1_HQ_S2|c14710/f2p15/2321
PB.5.1  i1_HQ_S2|c161525/f14p22/3663

where you notice some of the IDs had two underlines __ instead of one before S2. This was consistent with the HQ fastq IDs but were inconsistent with the cluster_report.csv, which all had only a single underline:

cluster_id,read_id,read_type
i3_ICE_S2|c0,m160923_181413_42165_c101102612550000001823258304261736_s1_p0/22583/37_16585_CCS,FL
i3_ICE_S2|c1,m160923_181413_42165_c101102612550000001823258304261736_s1_p0/34221/37_14492_CCS,FL
i3_ICE_S2|c2,m160923_181413_42165_c101102612550000001823258304261736_s1_p0/127040/12966_55_CCS,FL

(don't be alarmed to see _ICE_ instead of _HQ_ in cluster_report.csv; it's supposed to be that way because at the end of ICE, we don't know whether clusters become HQ or LQ; the script get_abundance_post_collapse.py understands this difference)

Once I changed the group file to remove the extra underline and ran get_abundance_post_collapse.py again, I got the right results and subsequently filter_away_subset.py worked as well.

I've put the fixed group file and filtered results here:
https://www.dropbox.com/s/ec7uhd3sbllenkb/fixed.tar.gz?dl=0

from cdna_cupcake.

markopetek commented on June 22, 2024

Thanks for the help Liz. I see that in the fastq file the sequences starting with @i0_HQ are followed by two underlines, those starting with @i1_HQ or @i2_HQ have only one. I'll rerun the gmap mapping in order to include the chloroplast and mitochondrial genomes besides the nuclear genome. I'll fix the input fastq before running the collapse script by replacing all @i0_HQ__S2 with @i0_HQ_S2. Is that OK or should I rather fix the group.txt file as you did?

from cdna_cupcake.

Magdoll commented on June 22, 2024

Fixing the HQ FASTQ file header itself is easier because that's the "root" of the problem. Fixing the group.txt is also fine, you just have to remember to do it every time you re-run collapse.

What's not clear to me is why for @i0_HQ there is two underlines. I've never seen it happen. Hopefully it's just a human error somewhere.
--Liz

from cdna_cupcake.

filter_away_subset.py error about cdna_cupcake HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs