Dear, For most study, people have conducted saturation analysis from

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Saturation analysis from isoform level about cdna_cupcake HOT 7 CLOSED

XiaoyuZhan520 commented on July 2, 2024

Saturation analysis from isoform level

from cdna_cupcake.

Comments (7)

Magdoll commented on July 2, 2024

Hi @XiaoyuZhan520 ,

If you are having HQ isoforms with only 1 FL read, you are using an outdated version of Iso-Seq. We've change the HQ isoform criterion to having a minimum of 2 FL reads (and predicted accuracy of 99%) for a long time. We increased the criterion to 2 exactly because many steps (transcription noise, library artifacts, sequencing artifacts) can introduce one-off errors. Using in-house data with control spike-ins, we observed that increasing to 2 FL reads dramatically reduced false positives while still recovering all the true positives.

The rarefaction curve can be drawn at both gene and isoforom level. Please see the wiki for details.

--Liz

from cdna_cupcake.

XiaoyuZhan520 commented on July 2, 2024

Hello Liz,

Thank you very much!

I've processed the data with SMRTlink 4.0. And there is no parameter setting for minimum fl reads.
I think I could set a criterion to filter out HQ isoforms which were covered by 1 FLNC reads.

But I found a lot of data (around 60%) were filtered out. Do you think it is reasonable?

from cdna_cupcake.

Magdoll commented on July 2, 2024

Hi @XiaoyuZhan520 ,
You can check the HQ/LQ sequence headers to confirm HQ only has sequences supported by 1 or 2 FL reads. There are scripts that can be used for filtering, too.

What do you mean by 60%? You mean 60% of the consensus sequences are LQ, or 60% of the FL reads end up as part of LQ? This depends a lot on sequencing depth and the sample diversity. I see from before you wrote that you had 50k FLNC reads only, which is pretty low depth (a single Sequel cell these days would give you 250k FLNC reads), so may not be surprising a lot of them are only supported by a single FL read. Since it looks like you have a genome, you can probably allow using 1-FL-read LQ reads that map well. Use the collapse script and use a stringent coverage (-c) and identity (-i) cutoff to help remove one-off artifacts.

--Liz

from cdna_cupcake.

XiaoyuZhan520 commented on July 2, 2024

Hello Liz,

Thanks for your help.

60% means that I have got a HQ polished transcript (FL>=1 && Accuracy > 0.99) dataset and 60% of these HQ polished transcripts are supported by only 1 FLNC.

I've found that most Iso-seq data have got this problem, which means the majority of transcripts after running cluster are supported by only 1 FLNC.
Here is an example: https://academic.oup.com/gigascience/article/7/3/giy009/4860431

If I conduct the saturation analysis from aspect of transcript, it is also un-saturated.
I wonder, if it is a feature for Iso-seq data?

from cdna_cupcake.

Magdoll commented on July 2, 2024

Hi @XiaoyuZhan520 ,

Older Iso-Seq runs done on the RS II which had lower throughput often were not sequenced deeply enough. Smaller and less complex organisms are easier to saturate.

You can take a look at this rarefaction collection for some ideas.

Saturation at the gene vs isoform level is different too. It's possible to saturate the gene level but still observe more transcripts, esp since Iso-Seq often recovers intron retained transcripts as well.

It looks like you probably have older runs and lower sequencing depth as well. As I mentioned before, if you have a genome, you can use HQ/LQ sequences that map well to the genome. I just don't want contamination of artifacts to inflate the true diversity.

--Liz

from cdna_cupcake.

XiaoyuZhan520 commented on July 2, 2024

Hello Liz,

Thanks for you files, it really helps.
I find a lot of cases are saturated in gene level, but not the isoform level.

Have you evaluated how much data is enough for a species? For example, how much data will be saturated for isoform analysis for case 'sequencing uhrr (universal human reference)'.

from cdna_cupcake.

Magdoll commented on July 2, 2024

Hi @XiaoyuZhan520 ,
You can take a look at slide 35 from this slide deck for a UHRR saturation study. At 10 Sequel cells (or 2 million full-length reads) we are reaching the limit of detecting known genes within the captured size ranges.

--Liz

from cdna_cupcake.

Saturation analysis from isoform level about cdna_cupcake HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs