GithubHelp home page GithubHelp logo

Comments (7)

Magdoll avatar Magdoll commented on June 22, 2024

Hi @markopetek ,

Your count file all_sizes.quivered_hq_MPE_S2.fastq.collapsed.abundance.txt is somehow incomplete. It is missing, for example, the count information for PB.1.1, which is present in other collapse output such as all_sizes.quivered_hq_MPE_S2.fastq.collapsed.GFF.

The file all_sizes.quivered_hq_MPE_S2.fastq.collapsed.group.txt shows that you have 9703 isoforms, this means there should be 9703+15 (for header)=9718 lines in all_sizes.quivered_hq_MPE_S2.fastq.collapsed.abundance.txt but it has only 4339.

--Liz

from cdna_cupcake.

markopetek avatar markopetek commented on June 22, 2024

Hi Liz,

I thought this incompleteness of the abundance file might be caused by some glitch in the gmap produced sam file so I've mapped the reads using STARlong, sorted and rerun the collapse and filer script again but I get the same erorr. Would you suggest any parameter change when running the collapse script in order to avoid this outcome? Maybe the -c 0.99 -i 0.95 are too strict since I've mapped cultivated potato transcripts to a reference which is a different subspecies - that might be why the gff has more isoforms than the abundance file.

from cdna_cupcake.

Magdoll avatar Magdoll commented on June 22, 2024

Hi @markopetek ,
Can you email me the input FASTQ file? I need it to run collapse so I can see if it's a bug in collapse. Whether you use GMAP or STAR or change parameters should have no effect.

from cdna_cupcake.

markopetek avatar markopetek commented on June 22, 2024

from cdna_cupcake.

Magdoll avatar Magdoll commented on June 22, 2024

Hi @markopetek ,

I believe I have identified the issue. Somehow some of your IDs had an extra underline in your FASTQ file, but they were not there in the cluster_report.csv (the cluster report is what is read to generate counts).

The group txt looked like this:

PB.1.1  i0_HQ__S2|c22385/f3p0/1042
PB.2.1  i0_HQ__S2|c218058/f3p12/1377
PB.2.2  i0_HQ__S2|c113604/f3p17/1237
PB.2.3  i0_HQ__S2|c356155/f2p6/825
PB.3.1  i0_HQ__S2|c212424/f2p1/1373
PB.3.2  i0_HQ__S2|c4103/f3p1/1275
PB.4.1  i1_HQ_S2|c114684/f2p18/3644
PB.4.2  i1_HQ_S2|c14710/f2p15/2321
PB.5.1  i1_HQ_S2|c161525/f14p22/3663

where you notice some of the IDs had two underlines __ instead of one before S2. This was consistent with the HQ fastq IDs but were inconsistent with the cluster_report.csv, which all had only a single underline:

cluster_id,read_id,read_type
i3_ICE_S2|c0,m160923_181413_42165_c101102612550000001823258304261736_s1_p0/22583/37_16585_CCS,FL
i3_ICE_S2|c1,m160923_181413_42165_c101102612550000001823258304261736_s1_p0/34221/37_14492_CCS,FL
i3_ICE_S2|c2,m160923_181413_42165_c101102612550000001823258304261736_s1_p0/127040/12966_55_CCS,FL

(don't be alarmed to see _ICE_ instead of _HQ_ in cluster_report.csv; it's supposed to be that way because at the end of ICE, we don't know whether clusters become HQ or LQ; the script get_abundance_post_collapse.py understands this difference)

Once I changed the group file to remove the extra underline and ran get_abundance_post_collapse.py again, I got the right results and subsequently filter_away_subset.py worked as well.

I've put the fixed group file and filtered results here:
https://www.dropbox.com/s/ec7uhd3sbllenkb/fixed.tar.gz?dl=0

from cdna_cupcake.

markopetek avatar markopetek commented on June 22, 2024

Thanks for the help Liz. I see that in the fastq file the sequences starting with @i0_HQ are followed by two underlines, those starting with @i1_HQ or @i2_HQ have only one. I'll rerun the gmap mapping in order to include the chloroplast and mitochondrial genomes besides the nuclear genome. I'll fix the input fastq before running the collapse script by replacing all @i0_HQ__S2 with @i0_HQ_S2. Is that OK or should I rather fix the group.txt file as you did?

from cdna_cupcake.

Magdoll avatar Magdoll commented on June 22, 2024

Fixing the HQ FASTQ file header itself is easier because that's the "root" of the problem. Fixing the group.txt is also fine, you just have to remember to do it every time you re-run collapse.

What's not clear to me is why for @i0_HQ there is two underlines. I've never seen it happen. Hopefully it's just a human error somewhere.
--Liz

from cdna_cupcake.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.