Comments (7)
Hi @XiaoyuZhan520 ,
If you are having HQ isoforms with only 1 FL read, you are using an outdated version of Iso-Seq. We've change the HQ isoform criterion to having a minimum of 2 FL reads (and predicted accuracy of 99%) for a long time. We increased the criterion to 2 exactly because many steps (transcription noise, library artifacts, sequencing artifacts) can introduce one-off errors. Using in-house data with control spike-ins, we observed that increasing to 2 FL reads dramatically reduced false positives while still recovering all the true positives.
The rarefaction curve can be drawn at both gene and isoforom level. Please see the wiki for details.
--Liz
from cdna_cupcake.
Hello Liz,
Thank you very much!
I've processed the data with SMRTlink 4.0. And there is no parameter setting for minimum fl reads.
I think I could set a criterion to filter out HQ isoforms which were covered by 1 FLNC reads.
But I found a lot of data (around 60%) were filtered out. Do you think it is reasonable?
from cdna_cupcake.
Hi @XiaoyuZhan520 ,
You can check the HQ/LQ sequence headers to confirm HQ only has sequences supported by 1 or 2 FL reads. There are scripts that can be used for filtering, too.
What do you mean by 60%? You mean 60% of the consensus sequences are LQ, or 60% of the FL reads end up as part of LQ? This depends a lot on sequencing depth and the sample diversity. I see from before you wrote that you had 50k FLNC reads only, which is pretty low depth (a single Sequel cell these days would give you 250k FLNC reads), so may not be surprising a lot of them are only supported by a single FL read. Since it looks like you have a genome, you can probably allow using 1-FL-read LQ reads that map well. Use the collapse script and use a stringent coverage (-c) and identity (-i) cutoff to help remove one-off artifacts.
--Liz
from cdna_cupcake.
Hello Liz,
Thanks for your help.
60% means that I have got a HQ polished transcript (FL>=1 && Accuracy > 0.99) dataset and 60% of these HQ polished transcripts are supported by only 1 FLNC.
I've found that most Iso-seq data have got this problem, which means the majority of transcripts after running cluster are supported by only 1 FLNC.
Here is an example: https://academic.oup.com/gigascience/article/7/3/giy009/4860431
If I conduct the saturation analysis from aspect of transcript, it is also un-saturated.
I wonder, if it is a feature for Iso-seq data?
from cdna_cupcake.
Hi @XiaoyuZhan520 ,
Older Iso-Seq runs done on the RS II which had lower throughput often were not sequenced deeply enough. Smaller and less complex organisms are easier to saturate.
You can take a look at this rarefaction collection for some ideas.
Saturation at the gene vs isoform level is different too. It's possible to saturate the gene level but still observe more transcripts, esp since Iso-Seq often recovers intron retained transcripts as well.
It looks like you probably have older runs and lower sequencing depth as well. As I mentioned before, if you have a genome, you can use HQ/LQ sequences that map well to the genome. I just don't want contamination of artifacts to inflate the true diversity.
--Liz
from cdna_cupcake.
Hello Liz,
Thanks for you files, it really helps.
I find a lot of cases are saturated in gene level, but not the isoform level.
Have you evaluated how much data is enough for a species? For example, how much data will be saturated for isoform analysis for case 'sequencing uhrr (universal human reference)'.
from cdna_cupcake.
Hi @XiaoyuZhan520 ,
You can take a look at slide 35 from this slide deck for a UHRR saturation study. At 10 Sequel cells (or 2 million full-length reads) we are reaching the limit of detecting known genes within the captured size ranges.
--Liz
from cdna_cupcake.
Related Issues (20)
- Question: where does the count.txt file come from?
- cluster id mismatch issue in get_abundance_post_collapse.py HOT 3
- chain_samples.py does not work!
- cupcake2 directory missing, compilation fails HOT 4
- Create sam format for collapse_isoforms_by_sam.py
- lima for skera.bam
- fusion_collate_info.py script has bugs and its not working.
- TypeError: iter_gmap_sam() got an unexpected keyword argument 'type' using collapse_isoforms_by_sam.py
- run_preCluster.py HOT 1
- setup.py requires sklearn instead of scikit-learn HOT 1
- fusion_collate_info.py KeyError: 'count_fl' HOT 3
- Cython compiler error when installing/building cDNA_cupcake HOT 2
- solved
- Saturation analysis bug fixes; compatibility with newer isoseq cluster [make_file_for_subsampling_from_collapsed.py]
- AttributeError: module 'numpy' has no attribute 'int'. HOT 2
- ModuleNotFoundError: No module named 'cupcake.tofu' after installing cdna_cupcake in conda subenv HOT 3
- fusion_finder.py HOT 1
- Demultiplex after clustering / genome alignment
- cdna_cupcake与SQANTI3
- collapse_isoforms_by_sam.py out of memory
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cdna_cupcake.