GithubHelp home page GithubHelp logo

Comments (7)

Magdoll avatar Magdoll commented on July 2, 2024

Hi @XiaoyuZhan520 ,

If you are having HQ isoforms with only 1 FL read, you are using an outdated version of Iso-Seq. We've change the HQ isoform criterion to having a minimum of 2 FL reads (and predicted accuracy of 99%) for a long time. We increased the criterion to 2 exactly because many steps (transcription noise, library artifacts, sequencing artifacts) can introduce one-off errors. Using in-house data with control spike-ins, we observed that increasing to 2 FL reads dramatically reduced false positives while still recovering all the true positives.

The rarefaction curve can be drawn at both gene and isoforom level. Please see the wiki for details.

--Liz

from cdna_cupcake.

XiaoyuZhan520 avatar XiaoyuZhan520 commented on July 2, 2024

Hello Liz,

Thank you very much!

I've processed the data with SMRTlink 4.0. And there is no parameter setting for minimum fl reads.
I think I could set a criterion to filter out HQ isoforms which were covered by 1 FLNC reads.

But I found a lot of data (around 60%) were filtered out. Do you think it is reasonable?

from cdna_cupcake.

Magdoll avatar Magdoll commented on July 2, 2024

Hi @XiaoyuZhan520 ,
You can check the HQ/LQ sequence headers to confirm HQ only has sequences supported by 1 or 2 FL reads. There are scripts that can be used for filtering, too.

What do you mean by 60%? You mean 60% of the consensus sequences are LQ, or 60% of the FL reads end up as part of LQ? This depends a lot on sequencing depth and the sample diversity. I see from before you wrote that you had 50k FLNC reads only, which is pretty low depth (a single Sequel cell these days would give you 250k FLNC reads), so may not be surprising a lot of them are only supported by a single FL read. Since it looks like you have a genome, you can probably allow using 1-FL-read LQ reads that map well. Use the collapse script and use a stringent coverage (-c) and identity (-i) cutoff to help remove one-off artifacts.

--Liz

from cdna_cupcake.

XiaoyuZhan520 avatar XiaoyuZhan520 commented on July 2, 2024

Hello Liz,

Thanks for your help.

60% means that I have got a HQ polished transcript (FL>=1 && Accuracy > 0.99) dataset and 60% of these HQ polished transcripts are supported by only 1 FLNC.

I've found that most Iso-seq data have got this problem, which means the majority of transcripts after running cluster are supported by only 1 FLNC.
Here is an example: https://academic.oup.com/gigascience/article/7/3/giy009/4860431

If I conduct the saturation analysis from aspect of transcript, it is also un-saturated.
I wonder, if it is a feature for Iso-seq data?

from cdna_cupcake.

Magdoll avatar Magdoll commented on July 2, 2024

Hi @XiaoyuZhan520 ,

Older Iso-Seq runs done on the RS II which had lower throughput often were not sequenced deeply enough. Smaller and less complex organisms are easier to saturate.

You can take a look at this rarefaction collection for some ideas.

Saturation at the gene vs isoform level is different too. It's possible to saturate the gene level but still observe more transcripts, esp since Iso-Seq often recovers intron retained transcripts as well.

It looks like you probably have older runs and lower sequencing depth as well. As I mentioned before, if you have a genome, you can use HQ/LQ sequences that map well to the genome. I just don't want contamination of artifacts to inflate the true diversity.

--Liz

from cdna_cupcake.

XiaoyuZhan520 avatar XiaoyuZhan520 commented on July 2, 2024

Hello Liz,

Thanks for you files, it really helps.
I find a lot of cases are saturated in gene level, but not the isoform level.

Have you evaluated how much data is enough for a species? For example, how much data will be saturated for isoform analysis for case 'sequencing uhrr (universal human reference)'.

from cdna_cupcake.

Magdoll avatar Magdoll commented on July 2, 2024

Hi @XiaoyuZhan520 ,
You can take a look at slide 35 from this slide deck for a UHRR saturation study. At 10 Sequel cells (or 2 million full-length reads) we are reaching the limit of detecting known genes within the captured size ranges.

--Liz

from cdna_cupcake.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.